CN115482257A

CN115482257A - Motion estimation method integrating deep learning characteristic optical flow and binocular vision

Info

Publication number: CN115482257A
Application number: CN202211149943.1A
Authority: CN
Inventors: 关乐; 张天琦; 王鑫阳; 王珍; 张志新
Original assignee: Dalian University
Current assignee: Dalian University
Priority date: 2022-09-21
Filing date: 2022-09-21
Publication date: 2022-12-16

Abstract

The invention discloses a motion estimation method integrating deep learning characteristic optical flow and binocular vision, which comprises the following steps: carrying out equalization pretreatment based on a controllable self-adaptive histogram on a driving image data set; constructing an optical flow feature extraction model based on deep learning, and performing recognition training on a moving target; ranging through a binocular camera to obtain the position of a target object; and acquiring the movement speed of the vehicle body. Compared with the traditional optical flow speed measurement, the method is based on the deep learning optical flow and the binocular imaging principle, motion parameter estimation of carrier displacement and speed can be achieved according to video data, the problems that the traditional optical flow estimation method is too sensitive and cannot stably estimate when the vehicle runs at night, namely in a weak light environment are solved, and reliability is further improved. Meanwhile, the method avoids the accumulated error of the traditional inertial sensor; and the defects of poor anti-interference capability and low updating frequency in the case of depending on GPS positioning and speed measurement.

Description

Motion estimation method integrating deep learning characteristic optical flow and binocular vision

Technical Field

The invention relates to the technical field of unmanned driving, in particular to a motion estimation method integrating deep learning characteristic optical flow and binocular vision.

Background

The optical flow is a relative visual motion between an observer and an observed scene, and includes motion information such as surfaces and edges of objects in the visual scene. The optical flow can be regarded as the projection of three-dimensional motion on a two-dimensional plane, and the optical flow can be applied to the fields of unmanned driving and the like because the optical flow contains rich motion and three-dimensional structure information of an object and has the characteristics of high robustness, high real-time performance, low cost and no error accumulation.

The existing mainstream speed measurement mode mainly depends on a carried inertial sensor, a GPS positioning speed measurement or a hybrid speed measurement and the like. Inertial sensors have high real-time performance, but accumulate errors with time; the GPS positioning and speed measuring has high precision, but low updating frequency and poor real-time performance, and is easy to receive signal interference. When the vehicle needs to be driven for a long time without signals, the speed measurement is carried out by the aid of ubiquitous optical flow information.

With the continuous development of computer technology, the processing precision and speed can be further improved by carrying out production learning through an artificial neural network (deep learning), and rich data information connotations can be extracted; in the unsupervised algorithm, a real light stream image is not needed as a training sample, a real scene is directly utilized for network training, and besides, a brightness conservation function, a smooth change function and the like are generally used for replacing an error loss function of an end-to-end terminal in a supervised learning model. In the present stage, the optical flow estimation of a moving object can be rapidly and accurately realized through the artificial neural network, but the optical flow realization in a weak light environment is still unstable. Most of the existing visual odometers are monocular visual odometers, visual inertial navigation odometers and the like, but all have the defects of low estimation precision, high requirement on environmental illumination intensity, poor stability and the like.

Disclosure of Invention

The invention aims to provide a motion estimation method integrating deep learning characteristic optical flow and binocular vision, which solves the problems of weak optical flow estimation capability and low precision of the traditional algorithm in a low-light environment; and the defects of accumulated error and poor real-time performance under the traditional motion estimation method are avoided, and a new estimation method is provided for the unmanned technology.

In order to achieve the above object, the present application provides a motion estimation method for fusing a deep learning feature optical flow and binocular vision, including:

carrying out equalization pretreatment based on a controllable self-adaptive histogram on a driving image data set;

constructing an optical flow feature extraction model based on deep learning, and performing recognition training on a moving target;

ranging through a binocular camera to obtain the position of a target object;

and acquiring the movement speed of the vehicle body.

Further, the vehicle driving image data set is subjected to equalization preprocessing based on a controllable self-adaptive histogram, specifically comprising the following steps: the method comprises the following steps of scaling an original driving image into a set resolution ratio, carrying out controllable self-adaptive histogram equalization processing, and then cutting the driving image into a preset value to limit amplification intensity to obtain a neighborhood cumulative distribution function:

wherein, cdf _min Is the minimum value of the cumulative distribution function of the pixel values, mxN is the number of running image pixels, G _i Is the number of gray levels.

Further, an optical flow feature extraction model based on deep learning is constructed, and recognition training is carried out on the moving target, specifically:

designing a local smoothing hypothesis according to the optical flow characteristics to obtain an optical flow equation:

wherein x is a pixel abscissa, y is a pixel ordinate, and t is time; dx, dy, dt are the differentials of x, y, t,/is the optical flow image information; alpha is a differential operation symbol, and delta x, delta y and delta t are change values of x, y and t;

constructing two CNN layers sharing weight to extract the characteristics of the driving image;

and (3) carrying out inner product calculation on the feature pairs of the two driving images: characteristic f ₁ ∈R ^H×W×D And characteristic f ₂ ∈R ^H×W×D Respectively representing driving images I ₁ And I ₂ The visual similarity is obtained by the pairwise inner product of the feature vectors, and is expressed as:

wherein C (f) ₁ ，f ₂ )∈R ^H×W×H×W Ij and kl are the position information of the light flow point of the first frame and the second frame, respectively, d is the specific channel of the image, and the value range is [0, D-1 ]]C is a four-dimensional vector feature; h and W are image resolution, and D is the number of channels;

constructing a pyramid to perform pooling operation on the four-dimensional vector characteristics;

obtaining four-dimensional vector characteristics of a high-resolution driving image: recording the corresponding point x' = (u + f) of optical flow between two frames due to data cost between pyramid levels ¹ (u)，v+f ² (v) U is the pixel abscissa, v is the pixel ordinate, f) ¹ As an optical flow feature of the first frame image, f ² For the optical flow feature of the second frame image, the neighborhood grid is

m is the number of layers, then

Finding any position corresponding to the optical flow on each layer, wherein k is any real number(ii) a According to the corresponding relation, the four-dimensional vector characteristics of the high-resolution driving image are expressed as follows:

wherein m is the mth layer of the pyramid, and p and q are respectively information of the pth row and the pth column in a pixel matrix of the optical flow point on the mth layer;

the CNN layer carries out iterative update on the driving image data: given a current optical flow state of f _k Each iteration generates a residual optical flow, i.e. an updated value f, relative to the output of the last iteration ¹ If delta f, the predicted value of the next optical flow is delta f + f _k ＝f _k+1 (ii) a The updating method comprises the following steps:

wherein R is _t To reset the gate, Z _t To update the gate, σ is a function operation, H _t For preserving the information content of the hidden state of the previous stage, H _t-1 For the hidden layer, X _t For the optical flow input value W _r 、W _z Is a weight information matrix.

After the four-dimensional vector characteristics of a high-resolution driving image (moving object image after motion enhancement) are obtained, the pixel information of an optical flow is obtained under the original resolution tracking of a pyramid, and then a moving target moving area is obtained, wherein the minimum image displacement of the area is v = [ v ] = _x ，v _y ] ^T The matching error and the minimum value epsilon (v) in each point neighborhood range are:

wherein v is _x ，v _y Respectively the transverse and longitudinal displacement of pyramid top layer, p _x As the abscissa of the luminous flux point, w _x Is a range of the abscissa neighborhood, p _y As the abscissa of the luminous flux point, w _y Is a longitudinal seatA scale domain, A (x, y) is a first frame optical flow feature, B (x, y) is a second frame optical flow feature;

carrying out recognition training on the moving target object in the moving area;

a supervision algorithm is selected for the model, and the loss function is set as:

i.e. the L1 norm of the iteration result and the true value, where N is the number of iterations, γ =0.8; f. of _gt To estimate the optical flow characteristics, f _i For the actual light flow characteristic, Δ x _gt 、Δy _rt To estimate the amount of lateral and longitudinal displacement of the optical flow, Δ x _i 、Δy _i And the horizontal and vertical displacement coordinates of the actual optical flow are obtained.

Further, the CNN layers comprise two residual layers with 1/2 resolution, two 1/4 resolution and two 1/8 resolution, and the number of channels is increased when the resolution between the residual layers is reduced by half; when extracting features, two continuous frames are input, then there is R ^H×W×3 →R ^H×W×D And H and W are image resolution, and D is the number of channels.

Further, constructing a pyramid to perform pooling operation on the four-dimensional vector features, specifically:

constructing three layers of similarity driving image pyramids, wherein kernels are respectively 1,2 and 4 from one layer to three layers, and pooling the latter two dimensions of the four-dimensional vector features; the road image pyramid is represented as follows:

wherein L is a pyramid layer, I ^L And x and y are optical flow point pixel position information.

Furthermore, the distance measurement is carried out through a binocular camera to obtain the position of the target, and the method specifically comprises the following steps:

obtaining distortion description according to parameters and distortion coefficients of a left camera and a right camera;

projecting and changing left and right images shot in the same scene at the same time, namely acquiring a correction mapping table and remapping by using the correction mapping table;

finding corresponding points in the left image and the right image through the SGM in stereo matching to obtain the visual difference disparity = u _l -u _r Wherein u is _l 、u _r Respectively generating a disparity map for the column coordinates of the target corresponding point in the left image and the right image, and then performing filtering processing by using a Guided Filter;

calculating the depth of the target

Where f is the focal length, b is the baseline length, d is the parallax, c _xr 、c _xl Column coordinates of two camera principal points.

Constructing a 3D space according to a group of disparity maps to obtain three-dimensional coordinates of pixel points:

[X Y Z W] ^T ＝Q*[g h disparity(g，h)1] ^T

wherein X, Y, Z and W are four-dimensional information obtained after matrix transformation; g. h is the position information of each pixel, Q is the perspective transformation matrix, and 3DImage (a, c, d) is the (x, y, z) coordinate information in the viewing coordinate system.

Further, the distortion disparity (g, h) is described as:

wherein s is ₁ 、s ₂ 、s ₃ Is the distortion coefficient of the thin prism, k ₁ 、k ₂ 、k ₃ 、k ₄ 、k ₅ 、k ₆ Is the radial distortion coefficient, p ₁ 、p ₂ Is the tangential distortion coefficient, r is the distortion radius;

furthermore, two images shot in the same scene at the same time are subjected to projection change, specifically:

the two image planes are parallel to the base line, and the same target point is positioned on the same horizontal line in the left image and the right image, namely the coplanar lines are aligned;

x point in space, through left and right lens C of camera ₁ 、C ₂ The projection matrix equation of (a) is respectively:

wherein (u) ₁ ，v ₁ 1) and (u) ₂ ，v ₂ 1) are each x ₁ And x ₂ Homogeneous coordinates in the respective images; (X, Y, Z, 1) is homogeneous coordinate of the point P under world coordinate; m is _ij Is the ith row and j column elements of the projection matrix M;

when a point becomes a straight line element, let c ₁ 、c ₂ The straight lines S of the left camera and the right camera corresponding to the same space respectively have the following linear equations in a space coordinate system:

and substituting the above formula into a projection matrix to obtain a mapping equation of the left camera and the right camera under a projection plane:

further, the vehicle body movement speed is obtained, specifically:

wherein d is _k 、d _k-1 And the Z-axis value is the three-dimensional coordinate value of the pixel point of the current frame and the pixel point of the previous frame.

Compared with the prior art, the technical scheme adopted by the invention has the advantages that: the method is based on the deep learning optical flow and binocular imaging principle, and can realize the motion parameter estimation of the displacement and the speed of the vehicle body according to the video data, so that the motion estimation can be carried out by the method when the vehicle is driven at night, and the reliability is further improved. Meanwhile, the method avoids the defects of poor anti-jamming capability and low updating frequency in the traditional method of relying on an inertial sensor and relying on GPS positioning speed measurement; the method has the characteristics of high portability, high real-time performance, excellent robustness, no accumulated error, low cost and the like, and provides new motion estimation for the unmanned technology.

Drawings

FIG. 1 is a hardware schematic involved in the embodiment;

FIG. 2 is a schematic block diagram of a motion estimation method;

FIG. 3 is a flow chart of a motion estimation method;

FIG. 4 is a view of the structure of an optical flow model;

FIG. 5 is a diagram of an optical flow model implementation;

FIG. 6 is a schematic diagram of tangential distortion formation during calibration and correction;

fig. 7 is a schematic diagram of coplanar row alignment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the application, i.e., the embodiments described are only a subset of, and not all embodiments of the application.

Example 1

The method determines the road environment characteristics through the optical flow strengthening information based on the artificial neural network, identifies the stationary objects at two sides of the road, determines the distance between the carrier and the target object, and further obtains the speed according to the distance difference and the time between the carrier and the target object between two frames. The camera and the video velocimeter are physically independent from each other, as shown in fig. 1. The schematic block diagram is shown in fig. 2, and the motion estimation system implementing the method includes: the system comprises a video acquisition unit, a frame-by-frame recording unit, a storage unit, a deep learning intelligent motion enhancement unit, an optical flow characteristic target identification unit and a motion analysis unit, wherein the motion enhancement unit and the target identification unit are packaged together and are combined into a motion target calibration unit. The images shot by the binocular camera are connected with the storage unit through the video acquisition unit and the bus, so that the shooting and recording functions are realized; the computer reads the collected image data, sequentially identifies roadside objects through the frame-by-frame recording unit, the moving target calibration unit and the motion analysis unit, constructs three-dimensional coordinate information, judges the motion state of the vehicle where the device is located through the relative motion speed of the three-dimensional coordinate information, and finally returns the information to the storage unit to perform a motion analysis function.

As shown in fig. 3, a method for motion estimation by fusing deep learning feature optical flow and binocular vision includes:

s1, carrying out controllable self-adaptive histogram equalization preprocessing on a driving image data set;

specifically, driving videos under different speeds and road conditions are collected, and a driving image data set X = X [ n ] is established] ₁ ,X[n] ₂ ,X[n] ₃ …X[n] _fo N is the video sequence number, fo is the video frame number.

Scaling the original driving image to a set resolution (such as 1088 multiplied by 436) and carrying out controllable adaptive histogram equalization (CLAHE) processing, then cutting the driving image to a preset value to limit the amplification intensity, and obtaining a neighborhood cumulative distribution function:

wherein, cdf _min Is the minimum value of the cumulative distribution function of the pixel values, mxN is the number of running image pixels, G _i Is the number of gray levels (which may be set to 256 in this embodiment). The method can improve the contrast of the image, enhance the color intensity of the image in a low-light environment and facilitate subsequent analysis.

S2, constructing an optical flow model, and performing recognition training on a moving target object; as shown in fig. 4:

s2.1 design local smoothing assumptions from optical flow characteristics: assume one: the brightness is constant, and the brightness of the image pixel is not changed when the frame moves; suppose two: small motion, the motion between pixel frames is small, namely the relative motion of the image along with the change of time is small; suppose three: spatially coherent, adjacent points on the same surface in the same scene have similar motion. The optical flow equation is derived from the above assumptions:

s2.2, two CNN layers sharing weight are constructed to extract the characteristics of the driving image, and the CNN layer architecture is shown in FIG. 5;

specifically, the CNN layers include two 1/2 resolutions, two 1/4 resolutions, and two 1/8 resolutions, and total six residual layers, and the number of channels increases when the resolution between the residual layers decreases by half; when extracting features, two continuous frames are input, then there is R ^H×W×3 →R ^H×W×D Where D is the number of channels, which may be set to 256.

S2.3, carrying out inner product calculation on the feature pairs of the two driving images;

in particular, characteristic f ₁ ∈R ^H×W×D And characteristic f ₂ ∈R ^H×W×D Respectively representing driving images I ₁ And I ₂ The visual similarity is obtained by the pairwise inner product of the feature vectors, and is expressed as:

wherein C (f) ₁ ，f ₂ )∈R ^H×W×H×W Ij and kl are the position information of the light flow point of the first and second frames, respectively, d is the specific channel of the image, and the value range is [0, D-1 ]]C is a four-dimensional vector feature; wherein H and W are image resolution, and D is the number of channels;

according to the optical flow equation and the brightness constancy hypothesis in step S2.1, the inter-image motion state is obtained by the least square method:

wherein, I _x (q _i )、I _y (q _i )、I _t (q _i ) Is the optical flow characteristic of the area pixel around the optical flow pixel position, i is 1,2,3 \8230;

s2.4, constructing a pyramid to perform pooling operation on the four-dimensional vector characteristics;

specifically, three layers of relevant pyramids are constructed, 1,2 and 4 kernels are respectively arranged from one layer to three layers, and pooling processing is performed on two dimensions behind the four-dimensional vector obtained in the step 2.3, so that motion change is more obvious. The road image pyramid is represented as follows:

wherein L is a pyramid layer, I ^L Is an image of the L layer.

S2.5, acquiring four-dimensional vector characteristics of the high-resolution driving image;

specifically, note the optical flow corresponding point x' = (u + f) between two frames ¹ (u)，v+f ² (v) U is the pixel abscissa, v is the pixel ordinate, f) ¹ As an optical flow feature of the first frame image, f ² For the optical flow feature of the second frame image, the neighborhood grid is

m is the number of layers, then pass

Searching any position corresponding to the optical flow on each layer, wherein k is any real number; according to the corresponding relation, the four-dimensional vector characteristics of the high-resolution driving image are expressed as follows:

s2.6, the CNN layer carries out iterative update on the driving image data;

specifically, the CNN layer performs an optical flow estimation to generate image features, and also performs a data iteration function. An update iteration is a sequence of classical gated cyclic units with previous data, which can be trained by shared weight convolution layers. Default initial value is 0, given current optical flow state as f _k Each iteration generates a residual optical flow, i.e. an updated value f, relative to the output of the last iteration ¹ If delta f, then the predicted value of the next optical flow is delta f + f _k ＝f _k+1 (ii) a The update gates of the gated loop unit are as follows:

wherein R is _t To reset the gate, Z _t To update the gate, σ is a function operation, H _t To remain the last oneAmount of information of stage hidden state, H _t-1 For the hidden layer, X _t For the optical flow input value W _r 、W _z Is a weight information matrix.

S2.7, after four-dimensional vector characteristics of a high-resolution driving image (moving object image after motion enhancement) are obtained, pixel information of an optical flow is obtained under the original resolution tracking of a pyramid, and then a motion target activity area is obtained, wherein the minimum displacement of the image in the area is v = [ v ] = [ v = _x ，v _y ] ^T The matching error and minimum value ε (v) in each point neighborhood range is:

wherein v is _x ，v _y Respectively the transverse and longitudinal displacement of the pyramid top layer, p _x As the abscissa of the luminous flux point, w _x Is a range of the abscissa neighborhood, p _y As the abscissa of the luminous flux point, w _y The range of the ordinate neighborhood is A (x, y) is a first frame optical flow characteristic, and B (x, y) is a second frame optical flow characteristic;

s2.8, performing recognition training on the moving target object in the moving area; ( Such as: common roadside facilities such as street trees, signboards, buildings and the like )

And selecting a supervision algorithm for the model, wherein the loss function is set as follows:

i.e. the L1 norm of the iteration result and the true value, where N is the number of iterations and γ =0.8.

S3, ranging through a binocular camera to obtain the position of a target object;

s3.1, obtaining distortion description according to parameters and distortion coefficients of the left camera and the right camera;

in particular, barrel distortion is caused by the configuration of the lens, and tangential distortion is caused by the inability of the lens and the imaging plane to be perfectly parallel during the camera assembly process. Therefore, distortion vectors are obtained through the internal parameters, the external parameters and the distortion coefficients of the left camera and the right camera; for correction in a subsequent step. The distortion disparity (g, h) is described as:

wherein s is ₁ 、s ₂ 、s ₃ Is the distortion coefficient of the thin prism, k ₁ 、k ₂ 、k ₃ 、k ₄ 、k ₅ 、k ₆ Is the radial distortion coefficient, p ₁ 、p ₂ Is the tangential distortion coefficient and r is the distortion radius.

S3.2, projecting and changing the left image and the right image which are shot in the same scene at the same time, namely acquiring a correction mapping table, and remapping by using the correction mapping table;

specifically, due to parallax, images obtained by the left lens and the right lens of the binocular camera cannot be completely overlapped, so that two views which are shot in the same scene at the same time are subjected to projection change: the two image planes are parallel to the baseline and the same target point is on the same horizontal line in both the left and right images, i.e., co-planar rows are aligned, as shown in fig. 7.

X point in space, passing through left and right lens C of camera ₁ 、C ₂ The projection matrix equations of (a) are respectively:

wherein (u) ₁ ,v ₁ 1) and (u) ₂ ,v ₂ 1) are each x ₁ And x ₂ Image homogeneous coordinates in the respective images; x, Y, Z, 1) is a homogeneous coordinate of the point P under world coordinates; m is a unit of _ij Is the ith row and j column element of the projection matrix M.

When a point becomes a straight line element, let c ₁ 、c ₂ Are respectively left and right phaseThe machine corresponds to a straight line S in the same space, and the linear equation in a space coordinate system is as follows:

s3.3 finding corresponding points in the left image and the right image through the SGM to obtain the visual difference disparity = u _l -u _r Wherein u is _l 、u _r And respectively generating a disparity map for the column coordinates of the target corresponding point in the left image and the right image, and then performing filtering processing by using a Guided Filter to reduce noise.

S3.4 calculating the depth of the target object

Where f is the focal length, b is the baseline length, d is the parallax, c _xr 、c _xl Column coordinates of two camera principal points;

s3.5, constructing a 3D space according to the group of parallax images to obtain three-dimensional coordinates of pixel points:

[X Y Z W] ^T ＝Q*[g h disparity(g，h)1] ^T

wherein X, Y, Z and W are four-dimensional information obtained after matrix transformation; g. h is the position information of each pixel, Q is the perspective transformation matrix, and 3DImage (a, c, d) is the (x, y, z) coordinate information in the viewing coordinate system. Finally, a three-dimensional matrix is obtained, and X, Y and Z coordinates are recorded respectively (a coordinate system is established by taking the left camera as a reference).

S4, acquiring the movement speed of the vehicle body;

specifically, the Z-axis coordinate value obtained in step 3.5 is extracted from two adjacent frames, and the vehicle body movement speed can be obtained as the camera shooting mode is 25fps and the relative movement principle

Wherein d is _k 、d _k-1 And the three-dimensional coordinate Z-axis value of the pixel point of the current frame and the pixel point of the previous frame is obtained.

The foregoing descriptions of specific exemplary embodiments of the present invention have been presented for purposes of illustration and description. It is not intended to limit the invention to the precise form disclosed, and obviously many modifications and variations are possible in light of the above teaching. The exemplary embodiments were chosen and described in order to explain certain principles of the invention and its practical application to enable one skilled in the art to make and use various exemplary embodiments of the invention and various alternatives and modifications as are suited to the particular use contemplated. It is intended that the scope of the invention be defined by the claims and their equivalents.

Claims

1. A motion estimation method for fusing deep learning characteristic optical flow and binocular vision is characterized by comprising the following steps:

measuring distance through a binocular camera to obtain the position of a target object;

and acquiring the movement speed of the vehicle body.

2. The method for estimating motion by fusing deep learning characteristic optical flow and binocular vision according to claim 1, wherein a driving image dataset is subjected to equalization preprocessing based on a controllable adaptive histogram, specifically: the method comprises the following steps of scaling an original driving image into a set resolution ratio, carrying out controllable self-adaptive histogram equalization processing, and then cutting the driving image into a preset value to limit amplification intensity to obtain a neighborhood cumulative distribution function:

3. The method for motion estimation by fusing deep learning characteristic optical flow and binocular vision according to claim 1, is characterized by constructing a deep learning-based optical flow characteristic extraction model and performing recognition training on a moving object, and specifically comprises the following steps:

wherein x is a pixel abscissa, y is a pixel ordinate, and t is time; dx, dy and dt are derivatives of x, y and t, and I is optical flow image information; alpha is a differential operation sign, and delta x, delta y and delta t are change values of x, y and t;

and (3) carrying out inner product calculation on the feature pairs of the two driving images: characteristic f ₁ ∈R ^H×W×D And feature f ₂ ∈R ^H×W×D Respectively represent driving images I ₁ And I ₂ The visual similarity is obtained by the pairwise inner product of the feature vectors, and is expressed as:

acquiring four-dimensional vector characteristics of a high-resolution driving image: recording the corresponding point x' = (u + f) of optical flow between two frames due to data cost between pyramid levels ¹ (u)，v+f ² (v) U is the pixel abscissa, v is the pixel ordinate, f) ¹ As an optical flow feature of the first frame image, f ² For the optical flow characteristics of the second frame image, the neighborhood grid is

m is the number of layers, then pass

the CNN layer carries out iterative update on the driving image data: given a current optical flow state of f _k Each iteration generates a residual optical flow, i.e. an updated value f, relative to the output of the last iteration ¹ Δ f, then the next step of optical flow predictionA value of Δ f + f _k ＝f _k+1 (ii) a The updating method comprises the following steps:

wherein R is _t To reset the gate, Z _t To update the gate, σ is a function operation, H _t For preserving the information content of the hidden state of the previous stage, H _t-1 For the hidden layer, X _t For the optical flow input value W _r 、W _z Is a weight information matrix;

after the four-dimensional vector characteristics of the high-resolution driving image are obtained, the pixel information of the optical flow is obtained under the original resolution tracking of the pyramid, and the moving object moving area is obtained, wherein the minimum image displacement of the area is v = [ v ] _x ，v _y ] ^T The matching error and minimum value ε (v) in each point neighborhood range is:

wherein v is _x ，v _y Respectively the transverse and longitudinal displacement of pyramid top layer, p _x As the abscissa of the luminous flux point, w _x Is a range of the abscissa neighborhood, p _y As the abscissa of the luminous flux point, w _y The range of the vertical coordinate neighborhood is defined, A (x, y) is a first frame optical flow characteristic, and B (x, y) is a second frame optical flow characteristic;

i.e. L1 norm of the iteration result and the true value, where N is the number of iterations, γ =0.8; f. of _gt To estimate the optical flow characteristics, f _i For the actual light flow characteristic, Δ x _gt 、Δy _rt To estimate the amount of lateral and longitudinal displacement of the optical flow, Δ x _i 、Δy _i And the horizontal and vertical displacement coordinates of the actual optical flow are obtained.

4. The method for motion estimation by fusing deep learning characteristic optical flow and binocular vision according to claim 3, wherein the CNN layers comprise two residual layers with 1/2 resolution, two 1/4 resolution and two 1/8 resolution, and the number of channels increases every half of the resolution between the residual layers; when feature extraction is performed, two continuous frames are input, and R is present ^H×W×3 →R ^H×W×D Wherein H and W are image resolution, and D is the number of channels.

5. The method for motion estimation by fusing deep learning feature optical flow and binocular vision according to claim 3, characterized by constructing a pyramid to perform pooling operation on the four-dimensional vector features, specifically:

constructing three layers of similarity running image pyramids, wherein kernels are 1,2 and 4 from one layer to three layers respectively, and pooling the latter two dimensions of the four-dimensional vector features; the road image pyramid is represented as follows:

6. The method for estimating the motion by fusing the optical flow with the deep learning features and the binocular vision according to claim 1, wherein the target position is obtained by ranging through a binocular camera, and specifically comprises the following steps:

obtaining distortion description according to parameters and distortion coefficients of the left camera and the right camera;

projecting and changing left and right images shot in the same scene at the same time, namely acquiring a correction mapping table and remapping the correction mapping table;

finding corresponding points in the left image and the right image through the SGM in stereo matching to obtain the visual difference disparity = u _l -u _r Wherein u is _l 、u _r Respectively generating a disparity map for the column coordinates of the target corresponding point in the left image and the right image, and then performing filtering processing by using a guidedFilter;

calculating the depth of the target

Where f is the focal length, b is the baseline length, d is the parallax, c _xr 、c _xl Column coordinates for two camera principal points;

[X Y Z W] ^T ＝Q*[g h disparity(g，h) 1] ^T

wherein X, Y, Z and W are four-dimensional information obtained after matrix transformation; g. h is position information of each pixel, Q is a perspective transformation matrix, and 3DImage (a, c, d) is (x, y, z) coordinate information in the viewing coordinate system.

7. The method for motion estimation by fusing deep learning feature optical flow and binocular vision according to claim 6, wherein the distortion disparity (g, h) is described as:

wherein s is ₁ 、s ₂ 、s ₃ Is the thin prism distortion coefficient, k ₁ 、k ₂ 、k ₃ 、k ₄ 、k ₅ 、k ₆ Is the radial distortion coefficient, p ₁ 、p ₂ Is the tangential distortion coefficient and r is the distortion radius.

8. The method for estimating motion by fusing deep learning characteristic optical flow and binocular vision according to claim 6, wherein two images simultaneously shot in the same scene are subjected to projection change, specifically:

x point in space, through left and right lens C of camera ₁ 、C ₂ The projection matrix equations of (a) are respectively:

wherein (u) ₁ ，v ₁ 1) and (u) ₂ ，v ₂ 1) are each x ₁ And x ₂ Homogeneous coordinates in the respective images; (X, Y, Z, 1) is a homogeneous coordinate of the point P under world coordinates; m is a unit of _ij Is the ith row and j columns of elements of the projection matrix M;

when a point becomes a straight line element, let c ₁ 、c ₂ The straight lines S corresponding to the same space are respectively taken as the left camera and the right camera, and the linear equation under the space coordinate system is as follows:

9. the method for estimating the motion by fusing the deep learning characteristic optical flow and the binocular vision according to claim 1, wherein the method for estimating the motion by fusing the deep learning characteristic optical flow and the binocular vision is characterized by acquiring the motion speed of a vehicle body, and specifically comprises the following steps: