CN112561947A

CN112561947A - Image self-adaptive motion estimation method and application

Info

Publication number: CN112561947A
Application number: CN202011434819.0A
Authority: CN
Inventors: 杨德龙; 尚鹏; 侯增涛; 王博; 付威廉
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2020-12-10
Filing date: 2020-12-10
Publication date: 2021-03-26

Abstract

The existing algorithms do not consider the time interval between image sequences, and assume that all images are acquired at the same time, thereby causing a certain error in calculation. The application provides an image self-adaptive motion estimation method, which comprises the steps of constructing a first deep convolutional neural network and a second deep convolutional neural network; constructing an objective function according to the first deep convolutional neural network and the second deep convolutional neural network, and simultaneously training the first deep convolutional neural network and the second deep convolutional neural network through the objective function to obtain a first deep convolutional neural network with fixed parameters and a second deep convolutional neural network with fixed parameters; and inputting the monocular image into a first depth convolution neural network to output a parallax image corresponding to the monocular image, and inputting an image sequence into a second depth convolution neural network to output a camera space pose transformation matrix. The adverse effect of the non-overlapping area between the images on the image reconstruction is avoided.

Description

Image self-adaptive motion estimation method and application

Technical Field

The application belongs to the technical field of artificial intelligence, and particularly relates to an image adaptive motion estimation method and application.

Background

Vision and hearing are the main ways in which humans perceive the external environment, where over 80% of the external information is obtained visually. Scene perception based on visual approach is a great challenge in the field of artificial intelligence and is an important component of a visual navigation system of an unmanned automobile. In the visual navigation system of the unmanned vehicle, three-dimensional information of a scene (parameters such as a relative distance between the scene and a camera, a spatial position and a posture of the camera) plays an important role. Meanwhile, the monocular camera has the advantages of small size, simple equipment, low cost, easiness in deployment and the like, and has application advantages compared with other sensors. Therefore, developing monocular image motion estimation algorithm research aiming at the unmanned scene has great significance for developing the unmanned automobile visual navigation system.

Currently, monocular image motion estimation algorithms based on deep learning are classified into supervised learning algorithms and unsupervised learning algorithms. The training data set of the supervised learning algorithm consists of a sequence of input images and a set of labels corresponding to each image. However, such a tag set is mostly completed by manual labeling, and the application range of the algorithm which is greatly limited is gradually eliminated. The unsupervised learning algorithm utilizes the space geometric relationship between images to design a supervision signal, is used for replacing a label set in the supervised algorithm, can finish the training and testing of a deep learning model only by using the images, and increasingly becomes the mainstream research direction.

In the design process of the target function, all images are defaulted to be static images acquired at the same moment, and the acquisition time interval between image sequences is ignored, so that certain errors are inevitably generated, and the algorithm precision is reduced. Through the analysis of the above problems, it was found that: the existing monocular image motion estimation method based on unsupervised deep learning only considers the problem from the geometric relationship between images and ignores the time dimension. Although a shorter time interval does not result in algorithm failure, such processing methods that defaults a dynamic scene image to a static scene image reduce the accuracy and robustness of the algorithm.

Disclosure of Invention

1. Technical problem to be solved

In the model training process, a monocular image sequence is required to be used as a training sample in the monocular image depth estimation and camera space pose calculation method based on unsupervised depth learning. The existing algorithm does not consider the time interval between image sequences, and assumes that all images are acquired at the same moment, thereby causing the problem of certain error in calculation.

2. Technical scheme

In order to achieve the above object, the present application provides an image adaptive motion estimation method, comprising the steps of: step 1: constructing a first deep convolutional neural network and a second deep convolutional neural network; step 2: constructing an objective function according to the first deep convolutional neural network and the second deep convolutional neural network, and simultaneously training the first deep convolutional neural network and the second deep convolutional neural network through the objective function to obtain a first deep convolutional neural network with fixed parameters and a second deep convolutional neural network with fixed parameters; and step 3: and inputting the monocular image into the first depth convolution neural network to output a parallax image corresponding to the monocular image, and inputting the image sequence into the second depth convolution neural network to output a camera space pose transformation matrix.

Another embodiment provided by the present application is: the first depth convolution neural network is a monocular image depth of field estimation network and is used for estimating the relative distance between a monocular camera and a scene; the second depth convolution neural network is a monocular camera space pose estimation network and is used for estimating the monocular camera space position and posture.

Another embodiment provided by the present application is: the monocular image depth of field estimation network is based on a depth residual error network, and the monocular image depth of field estimation network is of an encoding-decoding structure.

Another embodiment provided by the present application is: in the coding process, the network continuously extracts the wanted high-dimensional features and performs down-sampling through a convolutional layer, an active layer and a pooling layer; in the decoding process, the network performs up-sampling processing on the image through deconvolution, and outputs a multi-scale parallax image.

Another embodiment provided by the present application is: the monocular camera space pose estimation network is of an encoding structure.

Another embodiment provided by the present application is: and training the first deep convolutional neural network and the second deep convolutional neural network to perform iterative calculation on an objective function by adopting a gradient descent method until reaching the specified calculation times to obtain the first deep convolutional neural network with fixed parameters and the second deep convolutional neural network with fixed parameters.

Another embodiment provided by the present application is: the target function comprises an adaptive function constructed based on image global brightness difference and local brightness difference; and reconstructing images between the monocular image sequences, and combining the adaptive function to construct an adaptive error loss function of the reconstructed images and an adaptive loss function about the image depth edge constructed by combining the adaptive function.

Another embodiment provided by the present application is: the adaptive error loss function is based on the input image, the parallax image and the camera pose transformation matrix structure, and the adaptive loss function is based on the input image, the parallax image and the camera pose transformation matrix structure. Another embodiment provided by the present application is: the images include a target image and a reference image, and the reference image includes a first reference image and a second reference image.

The application also provides an application of the image self-adaptive motion estimation method, and the image self-adaptive motion estimation method is applied to an outdoor unmanned automobile or an unmanned autonomous navigation robot.

3. Advantageous effects

Compared with the prior art, the image self-adaptive motion estimation method and the application have the beneficial effects that:

the image self-adaptive motion estimation method is a monocular image self-adaptive motion estimation method based on unsupervised deep learning.

According to the image adaptive motion estimation method, an adaptive function is designed by utilizing image global brightness difference and local brightness difference and is used for distinguishing and processing overlapping and non-overlapping areas between image sequences, so that the method is applied to monocular image depth of field estimation and camera attitude estimation methods, and adverse effects caused by time interval problems between the image sequences are effectively solved.

According to the image adaptive motion estimation method, in the construction process of the target function, an adaptive function is designed based on the global brightness and the local brightness of the image and is used for distinguishing overlapping and non-overlapping areas in an image sequence, so that adverse effects of the non-overlapping areas between the images on image reconstruction are avoided.

The application of the image adaptive motion estimation method provided by the application aims at the problem of vehicle motion estimation (relative distance estimation between a monocular camera and a scene and spatial position and attitude estimation of the monocular camera) in an outdoor unmanned automobile or an unmanned autonomous navigation robot, and provides the adaptive motion estimation method.

Drawings

FIG. 1 is a schematic diagram illustrating the principles of the image adaptive motion estimation method of the present application;

FIG. 2 is a graphical representation of comparative experimental results on a KITTI data set of the present application;

fig. 3 is a graphical representation of comparative experimental results on the cityscaps dataset of the present application.

Detailed Description

Hereinafter, specific embodiments of the present application will be described in detail with reference to the accompanying drawings, and it will be apparent to those skilled in the art from this detailed description that the present application can be practiced. Features from different embodiments may be combined to yield new embodiments, or certain features may be substituted for certain embodiments to yield yet further preferred embodiments, without departing from the principles of the present application.

Referring to fig. 1 to 3, the present application provides an image adaptive motion estimation method, including the following steps: step 1: constructing a first deep convolutional neural network and a second deep convolutional neural network; step 2: constructing an objective function according to the first deep convolutional neural network and the second deep convolutional neural network, and simultaneously training the first deep convolutional neural network and the second deep convolutional neural network through the objective function to obtain a first deep convolutional neural network with fixed parameters and a second deep convolutional neural network with fixed parameters; and step 3: and inputting the monocular image into the first depth convolution neural network to output a parallax image corresponding to the monocular image, and inputting the image sequence into the second depth convolution neural network to output a camera space pose transformation matrix.

The step 1 and the step 2 are training processes, wherein in the step 2, the first deep convolutional neural network parameter and the second deep convolutional neural network parameter are adjusted through an objective function, and finally the first deep convolutional neural network with fixed parameters and the second deep convolutional neural network parameter with fixed parameters are obtained.

As shown in fig. 1, the training sample is composed of a monocular image sequence of length 3, wherein the second image is designated as the target image, and the first and third images are designated as reference image 1 and reference image 2, respectively. Training samples are simultaneously input into a monocular image depth of field estimation network AdaDepthNet and a monocular camera spatial pose estimation network AdaMotionNet. Through the calculation of the two depth convolution neural networks, the AdaDepthNet outputs a parallax image corresponding to the image sequence (in the case of known camera parameters, the parallax image and the depth image can be freely converted, and the conversion formula is as follows:

wherein Disparity represents a parallax image; depth represents a depth image; (i, j) represents pixel coordinates(ii) a f represents a camera focal length; b denotes a camera baseline), AdaMotionNet outputs a camera spatial pose transformation matrix T from the reference image 1, the reference image 2 to the target image_s1→tAnd T_s2→tLoss function L in the objective function_{ad_ph}And L_{ad_smooth}All based on the input image, the parallax image and the camera pose transformation matrix.

The method proposed by the application comprises two parts of training and testing. After the design of the AdaDepthNet and AdaMotionNet network structures and the construction of the objective function are completed, the network parameters are not known, the objective function needs to be iteratively calculated by using a gradient descent method until the specified calculation times are reached, and the process is a training process of the method. After the training process is completed, network parameters are fixed, at this time, a monocular image sequence can be used as input data, and the network directly outputs a corresponding parallax image or a camera space pose transformation matrix. The accuracy of the calculation result is directly determined by the objective function, so the design of the objective function is the core of the method.

Further, the first depth convolution neural network is a monocular image depth of field estimation network and is used for estimating the relative distance between the monocular camera and the scene; the second depth convolution neural network is a monocular camera space pose estimation network and is used for estimating the monocular camera space position and posture.

Further, the monocular image depth of field estimation network is based on a depth residual error network, and the monocular image depth of field estimation network is of an encoding-decoding structure.

Further, in the encoding process, the network continuously extracts the wanted high-dimensional features through a convolutional layer, an active layer and a pooling layer and performs down-sampling; in the decoding process, the network performs up-sampling processing on the image through deconvolution, and outputs a multi-scale parallax image.

Further, the monocular camera spatial pose estimation network is of an encoding structure.

AdaDepthNet contained in the technical scheme of the application is designed based on a depth residual error network ResNet and is an 'encoding-decoding' structure. In the coding process, the network continuously extracts the wanted high-dimensional features and performs down-sampling through a convolutional layer, an active layer and a pooling layer; in the decoding process, the network performs up-sampling processing on the image through deconvolution, and outputs a multi-scale parallax image, wherein the sizes of the multi-scale parallax image are as follows: (H, W), (H/2, W/2), (H/4, W/4), (H/8, W/8), wherein H, W respectively represent the height and width of the image. Adamobilenet directly uses an "encoding" structure (a network structure is shown in table 1, wherein conv1, conv2, … and conv6 represent the output of each layer of convolution, and the position represents a monocular camera spatial Pose transformation matrix), and a monocular image sequence is used as input, and the camera spatial Pose transformation matrix corresponding to the final output image is continuously calculated through the convolution layers. The objective function is composed of an input image, a parallax image output by AdaDepthNet and a camera space pose change matrix output by AdaMotionNet, and a functional block diagram of the method is shown in FIG. 1.

Inputting data	Output channel	Convolution kernel	Step size	Outputting the result
					Monocular image sequence	16	7	2	conv1
conv1	32	5	2	conv2
					conv2	64	5	2	conv3
conv3	128	3	2	conv4
					conv4	256	3	2	conv5
conv5	256	3	2	conv6
					conv6	48	1	1	Pose

Table 1 adamobility net network architecture

Further, the training of the first deep convolutional neural network and the second deep convolutional neural network is to perform iterative computation on an objective function by adopting a gradient descent method until a specified computation number is reached, so as to obtain fixed network parameters.

Further, the objective function comprises an adaptive function constructed based on image global brightness difference and local brightness difference; and reconstructing images between the monocular image sequences, and combining the adaptive function to construct an adaptive error loss function of the reconstructed images and an adaptive loss function about the image depth edge constructed by combining the adaptive function.

Further, the adaptive error loss function is based on the input image, the parallax image and the camera pose transformation matrix, and the adaptive loss function is based on the input image, the parallax image and the camera pose transformation matrix.

1. Construction of adaptive functions

Order (I)₁,I₂,I₃) Representing a training sample, designating a second image I₂Is a target image I_tThe first and third images are reference images, and the construction principle of the target function is to reconstruct the target image from the reference images

By calculation of

And I_tThe degree of similarity between replaces the supervisory signals. However, since the image sequence is continuously acquired by the moving camera in time sequence, there are inevitable non-overlapping regions between the continuous images, and the direct calculation inevitably has some abnormal points or abnormal regions which reduce the accuracy of the algorithm.

In order to enable the model to judge whether a pixel in the image belongs to the overlapping region, the method provides a self-adaptive loss function based on the brightness consistency of the image. Let I denote the input image or images,

representing a reconstructed image, designating the entire image as a global region, 5 x 5 imagesIf the pixel area is a local area, the global image brightness difference glo_pcThe calculation formula of (2) is as follows:

where (i, j) represents image pixel coordinates, Ω represents an image region, and | Ω | represents the number of pixels in the image region.

If the parallax image output by AdaDepthNet and the camera space pose transformation matrix output by AdaMotionNet are accurate values, the reconstruction result of the target image

Should be infinitely close to the input image I_tThat is, the brightness values corresponding to the pixels with the same coordinates in the two images are equal or infinitely close to equal, and at this time glo_pc→ 0 +; on the contrary, if the depth of field estimation result and the camera pose estimation result are poor, the reconstruction result of the target image and the input image have large difference, and the glo is performed at the moment_pcThe value of (a) is large and there is no regularity. Theoretically, as network training progresses, glo_pcWill decrease and eventually converge from positive to zero. The main reason for this is glo_pcThe average value of the brightness difference of the pixels of the global image is obtained, the image sequence has short acquisition interval, the overlapping rate between the reference image and the target image is high, and the adverse effect generated by a non-overlapping area after the average value is obtained is obviously reduced, but the average calculation only averages the calculation error to each pixel and does not reduce the total error value.

Local image luminance difference loc_pcThe calculation formula of (2) is as follows:

difference value loc for local image brightness_pcCannot obtain the same result as glo_pcSimilar conclusions are drawn. loc_pcCalculating the luminance of only a 5 × 5 pixel regionIn contrast, if the central pixel of the region is exactly located in the non-overlapping region, the image reconstruction algorithm cannot complete the reconstruction of the same region, even if the reconstruction accuracy of the overlapping region is high, the loc corresponding to the non-overlapping region_pcStill a random value greater than zero. Otherwise, if the central pixel point is in the overlapping region and the image reconstruction precision is very high, loc_pc→0+。

Therefore, the adaptive weight function constructed by the method of the present application using the difference between the global and local luminances is:

ω(i,j)＝exp(-(∈loc_pc(i,j)+(1-∈)glo_pc(i,j))) (3)

the method comprises the following steps of obtaining a local brightness difference function, obtaining a global brightness difference function, obtaining a self-adaptive weight value, wherein omega (i, j) belongs to (0, 1) and is an adaptive weight value, wherein (i, j) represents a pixel coordinate, and (i, j) belongs to omega.

In the image reconstruction process, the adaptive weight function can be understood as a mask, that is, a corresponding calculation rule is generated for each pixel, and whether the pixel belongs to the overlapping region is determined. ω (i, j) is about loc_pc(i, j) provided that pixel p is a decreasing function_tempLocated in the non-overlapping region, where the reconstructed image and the original input image correspond to loc_pcThe value of (i, j) is larger, and ω (p)_temp) Smaller or even close to zero. Global image luminance difference term glo whether or not the pixel is in a non-overlapping region_pc(p_temp) All of them are not changed so that the local brightness difference term loc of the image_pcPlays a major role in the adaptive weight function.

2. Adaptive error loss function for reconstructed images

For a monocular image sequence (I)₁,I₂,I₃) The reconstruction formula from the reference image to the target image is as follows:

wherein K represents a camera parameter matrix, which is a known quantity; t is_sn→tCamera for representing reference image to target imageA spatial pose transformation matrix; d_tRepresenting a depth image (which can be converted by a parallax image) corresponding to the target function; i is_nRepresenting a reference image; n is 1,3, corresponding to the first and third images.

The reconstruction results from the reference image to the target image obtained by the equation (4) are respectively

And

similar to the process of constructing the objective function above, the method selects the structured similarity functions SSIM () and L₁Norm construction image luminance loss function:

where η represents a weight value used to adjust the impact of the structured similarity function and the L1 norm on the outcome. The adaptive image luminance loss function is derived from equations (3) and (5):

When the image reconstruction effect is poor, the global brightness difference value glo between the reconstructed image and the input image_pcDifference value loc from local brightness_pcLarger, 1/ω (i, j) values become larger, resulting in an adaptive luminance loss function L_{ad_ph}Convergence is not possible; for non-overlapping regions, the local luminance difference value loc_pcIncrease, 1/ω (i, j) with respect to loc_pcAnd therefore also the loss function L, resulting in_{ad_ph}Convergence is not possible; the loss function L is only used when the image reconstruction result is close to the original input image and the reconstructed pixels belong to the overlap region_{ad_ph}It will converge.

3. Adaptive edge loss function for depth images

Considering that the brightness of the pixels of the depth image is suddenly changed in the edge area, similar to the adaptive brightness loss function, the method provides the adaptive edge loss function. Since the brightness of the image pixel also has sudden change at an isolated noise point, the input image is smoothed by using the gaussian kernel shown in formula (7) in this chapter, and then the image is subjected to laplace transform as an edge loss function.

"edge" may occur in any region of the image, so the method of the present application combines the adaptive function shown in equation (3) with the laplacian transform of the image to provide an adaptive function ω for the edge information of the depth image_edge：

Wherein (i, j) represents image pixel coordinates; Ω represents an image area, | Ω | represents the number of pixels in the image area;

representing the laplacian operator. The adaptive edge loss function term L is obtained from equation (8)_{ad_smooth}Comprises the following steps:

where d represents a depth image. The adaptive edge loss function shown in equation (9) not only considers that the pixel brightness changes abruptly in the edge region, but also performs a differential calculation between the overlapping and non-overlapping regions in the two images.

4. Objective function

The objective function corresponding to the joint type (6) and (8) available self-adaptive depth of field estimation model is:

wherein, tau₁And τ₂Representing a weight value for adjusting the importance degree of an image reconstruction loss function and a depth image edge loss function; s is the scale of the image.

Further, the image includes a target image and a reference image, and the reference image includes a first reference image and a second reference image.

After the objective function design is completed, iterative computation, namely a training process, is carried out on the objective function by using an open source data set KITTI and a Cityscapes data set facing an outdoor unmanned scene. In the testing process, the network parameters are fixed, and the scene depth information and the camera space pose transformation matrix corresponding to the monocular image can be directly calculated.

FIG. 2(a) is an input image randomly selected from a KITTI data set; FIG. 2(b) is a true depth image; FIG. 2(c) shows the depth of field estimation result output by the SfMLearner model; FIG. 2(d) shows the depth of field estimation result output by the GeoNet model; fig. 2(e) is a depth estimation result output by the adaptive depth estimation model proposed by the method of the present invention.

FIG. 3(a) is an input image randomly selected from the Cityscapes dataset; FIG. 3(b) is a true depth image; FIG. 3(c) shows the depth of field estimation result output by the SfMLearner model; FIG. 3(d) shows the depth of field estimation result output by the GeoNet model; fig. 3(e) is a depth estimation result output by the adaptive depth estimation model proposed in this chapter.

The results of the precision and accuracy comparison experiments are shown in table 2. The test set is 697 images (open source) divided by Eigen from KITTI data set, and the evaluation indexes are Absolute Relative Error (Absolute Relative Error, AbsRel), Square Relative Error (Square Relative Error, SqRel), Linear Root Mean Square Error (RMSElinear), logarithmic Root Mean Square Error (RMSElG 10) and precision (Correct), respectively. The first four error indexes are used for evaluating the prediction accuracy of the model, and the lower the error value is, the higher the prediction accuracy of the model is; the last accuracy index is used for evaluating the prediction accuracy of the model, and the prediction accuracy of the model is in direct proportion to the index value.

The baseline comparison model (baseline) was: supervised learning models Eigen, Liu; unsupervised learning models ACA, r.garg, SfMLearner, GeoNet, GASDA and D-SLAM. AdaModel (K), AdaModel (C) denote adaptive depth of field estimation models trained on the KITTI dataset, the cityscaps dataset, respectively, AdaModel (C + K) denotes an adaptive depth of field estimation model trained on the cityscaps dataset and the KITTI dataset.

TABLE 2 depth of field estimation contrast results

The method selects an ORB-SLAM model, an SfMLearner model, a GeoNet model and a D-SLAM model as comparison algorithms, and uses absolute path error (ATE) as a quantitative evaluation standard of model precision, and the comparison experiment results are shown in Table 3.

Model (model)	Image sequence 09	Image sequence 10
			ORB-SLAM(full)	0.014±0.008	0.012±0.011
ORB-SLAM(short)	0.064±0.141	0.064±0.130
			SfMLearner	0.021±0.017	0.020±0.015
GeoNet	0.012±0.007	0.012±0.009
			D-SLAM	0.017±0.008	0.015±0.017
AdaModel(K)	0.012±0.005	0.012±0.006

TABLE 3 comparative test results of visual odometer (ATE)

The method comprises a depth convolution neural network AdaDepthNet used for estimating the relative distance between the monocular camera and a scene, a depth convolution neural network AdaMotionNet used for estimating the spatial position and the attitude of the monocular camera, an adaptive function used for distinguishing an image overlapping region from a non-overlapping region, an objective function of an adaptive motion estimation method and the like.

Although the present application has been described above with reference to specific embodiments, those skilled in the art will recognize that many changes may be made in the configuration and details of the present application within the principles and scope of the present application. The scope of protection of the application is determined by the appended claims, and all changes that come within the meaning and range of equivalency of the technical features are intended to be embraced therein.

Claims

1. An image adaptive motion estimation method, characterized by: the method comprises the following steps:

step 1: constructing a first deep convolutional neural network and a second deep convolutional neural network;

step 2: constructing an objective function according to the first deep convolutional neural network and the second deep convolutional neural network, and simultaneously training the first deep convolutional neural network and the second deep convolutional neural network through the objective function to obtain a first deep convolutional neural network with fixed parameters and a second deep convolutional neural network with fixed parameters;

and step 3: and inputting the monocular image into the first depth convolution neural network to output a parallax image corresponding to the monocular image, and inputting the image sequence into the second depth convolution neural network to output a camera space pose transformation matrix.

2. The image adaptive motion estimation method of claim 1, characterized in that: the first depth convolution neural network is a monocular image depth of field estimation network and is used for estimating the relative distance between a monocular camera and a scene; the second depth convolution neural network is a monocular camera space pose estimation network and is used for estimating the monocular camera space position and posture.

3. The image adaptive motion estimation method of claim 2, wherein: the monocular image depth of field estimation network is based on a depth residual error network, and the monocular image depth of field estimation network is of an encoding-decoding structure.

4. An image adaptive motion estimation method according to claim 3, characterized in that: in the coding process, the network continuously extracts the wanted high-dimensional features and performs down-sampling through a convolutional layer, an active layer and a pooling layer; in the decoding process, the network performs up-sampling processing on the image through deconvolution, and outputs a multi-scale parallax image.

5. The image adaptive motion estimation method of claim 2, wherein: the monocular camera space pose estimation network is of an encoding structure.

6. The image adaptive motion estimation method of claim 1, characterized in that: and training the first deep convolutional neural network and the second deep convolutional neural network to perform iterative calculation on an objective function by adopting a gradient descent method until reaching a specified calculation frequency to obtain the first deep convolutional neural network with fixed parameters and the first deep convolutional neural network with fixed parameters.

7. An image adaptive motion estimation method according to claim 6, characterized in that: the target function comprises an adaptive function constructed based on image global brightness difference and local brightness difference; and reconstructing images between the monocular image sequences, and combining the adaptive function to construct an adaptive error loss function of the reconstructed images and an adaptive loss function about the image depth edge constructed by combining the adaptive function.

8. An image adaptive motion estimation method according to claim 7, characterized in that: the adaptive error loss function is based on the input image, the parallax image and the camera pose transformation matrix structure, and the adaptive loss function is based on the input image, the parallax image and the camera pose transformation matrix structure.

9. An image adaptive motion estimation method according to any one of claims 1 to 8, characterized by: the images include a target image and a reference image, and the reference image includes a first reference image and a second reference image.

10. An application of an image adaptive motion estimation method is characterized in that: the image adaptive motion estimation method of any one of claims 1-9 is applied to an outdoor unmanned automobile or an unmanned autonomous navigation robot.