CN112233149A

CN112233149A - Scene flow determination method and device, storage medium and electronic device

Info

Publication number: CN112233149A
Application number: CN202011174348.4A
Authority: CN
Inventors: 崔婵婕; 任宇鹏; 卢维; 王晓鲁; 黄积晟
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-10-28
Filing date: 2020-10-28
Publication date: 2021-01-15

Abstract

The embodiment of the invention provides a method and a device for determining scene flow, a storage medium and an electronic device, wherein the method comprises the following steps: determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image; determining a scene flow energy function of the binocular image based on the processing parameters; optimizing a scene flow energy function to obtain a target scene flow energy function; and determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function. By the method and the device, the problem that the speed and the precision for solving the energy function are insufficient in the related technology is solved, and the effect of improving the speed and the precision for solving the complex energy function is achieved.

Description

Scene flow determination method and device, storage medium and electronic device

Technical Field

The embodiment of the invention relates to the field of images, in particular to a scene flow determining method and device, a storage medium and an electronic device.

Background

Scene flow estimation aims at estimating a three-dimensional motion vector through two consecutive binocular images, namely the change of each point in the images in two frames before and after. Accurate estimation of scene flow is of paramount importance to the development of robotics, where it is used to enable autonomous navigation and manipulation in dynamic environments. The estimation of the scene flow not only can enable the machine to be accurately positioned, but also can improve the perception and prejudgment capacity of the movement of surrounding objects, so that collision is avoided in the process of planning the path. At present, there are two methods for calculating scene streams, which are based on binocular stereo vision and red, green and blue depth maps RGBD. The depth sensor can be used for directly acquiring more accurate depth information by using the sensor, so that the accuracy can be improved, and the time can be saved. However, errors and even errors are easily caused at the occlusion based on the RGBD method. The binocular stereo matching based method mainly comprises the steps of estimating parallax and depth information through left and right images shot by a binocular camera and light stream information of front and rear frames of images, and calculating a scene stream result through the depth information and the light stream information. The method consumes a lot of time additionally because the estimation is a disparity map, and the process of solving the scene flow through the optical flow and the disparity information is time-consuming and is difficult to achieve real-time prediction. In addition, the performance of deep learning in various fields is gradually superior to that of the traditional algorithm, but the deep learning is seriously dependent on the accuracy of artificial labeling data, and a scene flow result represents the motion state of each pixel and cannot be labeled, so that the solution using the deep learning method is limited.

Aiming at the problem of insufficient speed and precision of solving the energy function in the prior art, an effective solution is not provided in the related technology.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining a scene flow, a storage medium and an electronic device, which are used for solving the problems of insufficient speed and accuracy of solving an energy function in the related art.

According to an embodiment of the present invention, there is provided a method for determining a scene stream, including: determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image; determining a scene flow energy function of the binocular image based on the processing parameters; optimizing the scene flow energy function to obtain a target scene flow energy function; and determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function.

According to another embodiment of the present invention, there is provided a scene stream determination apparatus including: the first determining module is used for determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image; the second determining module is used for determining a scene flow energy function of the binocular image based on the processing parameters; the third determining module is used for optimizing the scene flow energy function to obtain a target scene flow energy function; and the fourth determining module is used for determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function.

In an exemplary embodiment, the first determining module includes: the binocular image acquisition device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring the binocular image by using binocular camera equipment, the binocular image comprises an Nth frame left image and an Nth frame right image, and N is a natural number greater than or equal to 1; the first determining unit is used for carrying out example segmentation on the binocular image through a preset example segmentation network to obtain an example segmentation result; the second determining unit is used for determining the displacement of the pixel points in the binocular image in a preset coordinate system through a preset optical flow estimation network to obtain an optical flow result; a third determining unit, configured to determine, through a preset stereo matching network, a coordinate offset of a pixel point in the binocular image in the preset coordinate system, so as to obtain the parallax result; a fourth determination unit configured to determine the division result, the optical flow result, and the parallax result as the processing parameters.

In an exemplary embodiment, the second determining module includes: a fifth determining unit, configured to determine a luminosity error constraint term, a rigid fitting constraint term, and an optical flow consistency constraint term of the binocular image by using the example segmentation result, the optical flow result, the parallax result, and a rotation-translation matrix, wherein the rotation-translation matrix is determined based on a rotation translation amount of a target object in the binocular image in an adjacent frame image; and a sixth determining unit, configured to determine a scene flow energy function of the binocular image by using the photometric error constraint term, the rigid fitting constraint term, and the optical flow consistency constraint term.

In an exemplary embodiment, the sixth determining unit includes: a first determining subunit for determining E_i＝E_photo,i+E_rigid,i+E_flow,i(ii) a Wherein, E is as defined above_iFor representing the scene flow energy function, E_photo,iFor representing the above photometric error constraint term, E_rigid,iFor the rigid fitting constraint term and E_flow,iAnd i is an integer of 0 or more for representing the optical flow consistency constraint term.

In an exemplary embodiment, the fifth determining unit includes: a second determination subunit for determining

Wherein, E is as defined above_photo,iFor representing the above photometric error constraint term, E_rigid,iFor the rigid fitting constraint term and E_flow,iI is an integer of 0 or more, and α is an integer representing the optical flow consistency constraint term_pFor indicating whether the current pixel in the binocular image is an indication function, and the P is used for indicating the current instance P in the binocular image_iThe pixel point in (1), the above P_iFor representing the left image L in the above binocular image⁰Example set of

In the ith example, p' is used for representing the pixel point of the example after the three-dimensional point cloud of the nth frame p is re-projected to the (N + 1) th frame through the internal and external parameters of the camera device, and RT is used for representing the rotation and translation matrix.

In one exemplary embodiment, p' is determined by:

wherein, the above pi_KThe system comprises a camera device, a three-dimensional point cloud acquisition device, a three-dimensional image acquisition device and a three-dimensional image acquisition device, wherein the camera device is used for acquiring a three-dimensional point cloud; as described above

The three-dimensional point cloud is used for representing the three-dimensional point cloud which is reconstructed by the three-dimensional calibration parameters of the camera equipment by using the two-dimensional coordinates and the parallax map; and the q is used for representing the pixel point after the pixel point p in the binocular image is added with the optical flow value corresponding to the p.

In an exemplary embodiment, the third determining module includes: and a seventh determining unit, configured to optimize the scene stream energy function by using a long-term and short-term memory LSTM network to obtain the obtained target scene stream energy function, where the LSTM network includes a hidden layer LSTM and a linear layer.

In an exemplary embodiment, the seventh determining unit includes: a third determining subunit, configured to determine a training number and an expansion number, where the training number is used to indicate a number of times for updating the LSTM network, and the expansion number is used to indicate a number of times for updating the rotation-translation matrix; a fourth determining subunit, configured to initialize the LSTM network to obtain an initial LSTM network; and a fifth determining subunit, configured to, when the number of times of updating the LSTM network is smaller than the training number of times and the number of times of updating the rotation-translation matrix is smaller than the expansion number of times, input the read sampling processing parameter and the initial rotation-translation matrix into the initial LSTM network to train the initial LSTM network, so as to obtain the obtained target scene flow energy function.

In an exemplary embodiment, the apparatus further includes: and the testing module is used for testing the target scene flow energy function by using the processing parameters of the pixel points in the binocular image and the initial rotation translation matrix RT after the scene flow energy function is optimized to obtain the target scene flow energy function.

According to a further embodiment of the present invention, there is also provided a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the method and the device, the processing parameters of the pixel points in the obtained binocular image are determined, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image; determining a scene flow energy function of the binocular image based on the processing parameters; optimizing a scene flow energy function to obtain a target scene flow energy function; and determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function. The method and the device realize that the target scene flow function is determined by using a small amount of parameters. Therefore, the problem that the speed and the precision for solving the energy function in the related technology are insufficient can be solved, and the effect of improving the speed and the precision for solving the complex energy function is achieved.

Drawings

Fig. 1 is a block diagram of a hardware configuration of a mobile terminal of a method for determining a scene flow according to an embodiment of the present invention;

fig. 2 is a flowchart of a method of determining a scene stream according to an embodiment of the present invention;

FIG. 3 is an overall flow diagram according to an embodiment of the invention;

FIG. 4 is a flowchart of an implementation process of training according to an embodiment of the invention;

FIG. 5 is a flow diagram of predicting an LSTM optimization model according to an embodiment of the present invention;

FIG. 6 is a graph showing comparative results according to an embodiment of the present invention;

fig. 7 is a block diagram of a configuration of a scene flow determination apparatus according to an embodiment of the present invention.

Detailed Description

Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings in conjunction with the embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking an example of the application to a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a method for determining a scene flow according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, wherein the mobile terminal may further include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the method for determining a scene flow in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In the present embodiment, a method for determining a scene flow is provided, and fig. 2 is a flowchart of a method for determining a scene flow according to an embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image;

step S204, determining a scene flow energy function of the binocular image based on the processing parameters;

step S206, optimizing a scene flow energy function to obtain a target scene flow energy function;

step S208, determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function.

The execution subject of the above steps may be a terminal, but is not limited thereto.

Through the steps, processing parameters of pixel points in the obtained binocular image are determined, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image; determining a scene flow energy function of the binocular image based on the processing parameters; optimizing a scene flow energy function to obtain a target scene flow energy function; and determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function. The method and the device realize that the target scene flow function is determined by using a small amount of parameters. Therefore, the problem that the speed and the precision for solving the energy function in the related technology are insufficient can be solved, and the effect of improving the speed and the precision for solving the complex energy function is achieved.

In one exemplary embodiment, determining the processing parameters of the pixel points in the acquired binocular image includes:

s1, acquiring binocular images by using binocular camera equipment, wherein the binocular images comprise an Nth frame left image and an Nth frame right image, and N is a natural number greater than or equal to 1;

s2, carrying out example segmentation on the binocular image through a preset example segmentation network to obtain an example segmentation result;

s3, determining the displacement of the pixel points in the binocular image in a preset coordinate system through a preset optical flow estimation network to obtain an optical flow result;

s4, determining coordinate offset of pixel points in the binocular image in a preset coordinate system through a preset stereo matching network to obtain a parallax result;

s5, the division result, the optical flow result, and the parallax result are determined as processing parameters.

Alternatively, in the present embodiment, left and right eye images of consecutive frames may be acquired by a binocular camera. For example, the left and right eye images are subjected to stereo calibration and correction to obtain the internal and external parameters and the corrected images of the camera. The first frame left image and the second frame left image can obtain the displacement of each pixel of the first frame left image in the x and y directions through an optical flow network, namely an optical flow result; the first frame left and right images can obtain the offset of each pixel of the first frame left image in the x direction, namely a parallax result, through a binocular stereo matching network; the first frame left image can obtain an example segmentation result through an example segmentation network.

In one exemplary embodiment, determining a scene flow energy function for a binocular image based on processing parameters includes:

s1, determining luminosity error constraint terms, rigid fitting constraint terms and optical flow consistency constraint terms of the binocular images by utilizing the example segmentation results, the optical flow results, the parallax results and a rotation and translation matrix, wherein the rotation and translation matrix is determined based on the rotation and translation amount of the target object in the binocular images in the adjacent frame images;

and S2, determining a scene flow energy function of the binocular image by using the photometric error constraint term, the rigid fitting constraint term and the optical flow consistency constraint term.

In one exemplary embodiment, determining a scene flow energy function for a binocular image using a photometric error constraint term, a rigid fit constraint term, and an optical flow consistency constraint term includes:

E_i＝E_photo,i+E_rigid,i+E_flow,i；

wherein E is_iFor representing a scene flow energy function, E_photo,iFor representing photometric error constraints, E_rigid,iFor representing rigid fitting constraints and E_flow,iFor representing optical flow consistency constraint terms, i is an integer greater than or equal to 0.

Alternatively, in the present embodiment, for example, with L⁰,R⁰,L¹,R¹Left and right images respectively representing two consecutive frames of binocular images, D⁰,D¹Disparity maps representing the 0 th and 1 st frames, respectively, F_L,F_RRespectively showing the corresponding optical flow diagrams of two continuous frames of left and right eyes,

the example division result of the 0 th frame of the left eye image is shown, and RT shows the rotation and translation matrix of the target between the front frame and the rear frame. The construction of the scene flow in the embodiment comprises three parts, and the scene flow energy function is constructed from three levels of gray value, three-dimensional point cloud and optical flow of each instance in the scene based on photometric error constraint, rigid fitting constraint and optical flow consistency constraint respectively.

In one exemplary embodiment, determining photometric error constraint terms, rigid fitting constraint terms and optical flow consistency constraint terms of the binocular images by using the example segmentation results, the optical flow results, the parallax results and the rotation and translation matrix comprises:

wherein E is_photo,iFor representing photometric error constraints, E_rigid,iFor representing rigid fitting constraints and E_flow,iFor representing optical flow consistency constraint terms, i being an integer greater than or equal to 0, α_pFor indicating whether the current pixel in the binocular image is an indicator function, P for indicating the current instance P in the binocular image_iPixel point of (5), P_iFor representing the left image L in a binocular image⁰Example set S_L ⁰In the ith example, p' is used for representing the pixel point of the example after the three-dimensional point cloud of the Nth frame p is re-projected to the (N + 1) th frame through the internal and external parameters of the camera equipment, and RT is used for representing the rotation and translation matrix.

In one exemplary embodiment, p' is determined by:

wherein, pi_KThe system comprises a camera device, a three-dimensional point cloud acquisition device, a three-dimensional image acquisition device and a three-dimensional image acquisition device, wherein the camera device is used for acquiring a three-dimensional point cloud;

the three-dimensional point cloud is used for representing the three-dimensional point cloud which is reconstructed by using the two-dimensional coordinates and the disparity map through the three-dimensional calibration parameters of the camera equipment; q for representing pixels in a binocular imageAnd adding the light stream value corresponding to the point p.

In an exemplary embodiment, optimizing the scene flow energy function to obtain the target scene flow energy function includes:

and S1, optimizing the scene flow energy function by using a long-short term memory LSTM network to obtain a target scene flow energy function, wherein the LSTM network comprises a hidden layer LSTM and a linear layer.

Optionally, in this embodiment, the method of simulating gradient descent expansion by the LSTM network is used for training and predicting, and the updating of RT and the updating of LSTM network are performed in turn during the training process.

In an exemplary embodiment, optimizing a scene flow energy function using a long-short term memory LSTM network to obtain a target scene flow energy function includes:

s1, determining the training times and the expansion times, wherein the training times are used for representing the times of updating the LSTM network, and the expansion times are used for representing the times of updating the rotation translation matrix;

s2, initializing the LSTM network to obtain an initial LSTM network;

and S3, under the condition that the number of updating the LSTM network is less than the training number and the number of updating the rotation translation matrix is less than the expansion number, inputting the read sampling processing parameters and the initial rotation translation matrix into the initial LSTM network to train the initial LSTM network and obtain the target scene flow energy function.

Alternatively, in the present embodiment, the LSTM is randomly initialized by setting the number of training times and the number of expansion times. Training is performed based on the number of training times and the number of unfolding times.

In an exemplary embodiment, after optimizing the scene flow energy function and obtaining the target scene flow energy function, the method further includes:

and S1, testing the target scene flow energy function by using the processing parameters of the pixel points in the binocular image and the initial rotation translation matrix RT.

Alternatively, in the present embodiment, left and right images of a set of consecutive frames are obtained, and an initial RT and energy function expression form can be obtained through example segmentation, disparity estimation, and optical flow estimation. In the embodiment, the examples are taken as units to be respectively optimized, for each example, LSTM expansion times n are given, a current loss value is obtained according to an RT input energy function, the gradient of loss to RT is calculated, the gradient is input into the LSTM to obtain the update quantity of the RT, and the RT is updated; and outputting the corresponding RT value when the loss value is minimum after updating for n times, wherein the value is the motion state of the example.

The invention is illustrated below with reference to specific examples:

in this embodiment, the scene stream is used to represent a three-dimensional motion vector of each pixel point in the image in the real scene, and is represented by a set of RTs. The scene flow can be obtained by calculating binocular images of two continuous frames, generally, an optical flow is calculated by two frames of images, and the optical flow represents the motion displacement of pixel points in the x and y directions; the parallax can be calculated according to the binocular left and right images, and the depth of the point can be estimated according to the parallax and the camera internal and external parameters. And constructing a scene flow energy function according to the optical flow result and the parallax result, and optimizing to obtain a scene flow result.

As shown in fig. 3, the overall process in this embodiment includes the following steps:

s301: left and right eye images of continuous frames can be obtained through the binocular camera, and the left and right eye images are subjected to three-dimensional calibration and correction to obtain internal and external parameters and corrected images of the camera.

S302: the first frame left image and the second frame left image can obtain the displacement of each pixel of the first frame left image in the x and y directions through an optical flow network; the offset of each pixel of the first frame left image in the x direction, namely parallax, can be obtained by the first frame left image and the first frame right image through a binocular stereo matching network; the first frame left image can obtain an example segmentation result through an example segmentation network.

S303: and constructing a scene flow energy function through the example segmentation result, the optical flow result and the parallax result, and calculating the initial RT.

S304: and gradually optimizing an energy function by using the LSTM, and continuously updating the RT to finally obtain the RT of each instance.

S305: and obtaining a scene flow result.

Optionally, the scene flow energy function is constructed as follows:

in the embodiment, disparity estimation is performed based on the stereo matching network Dispnet of deep learning, optical flow calculation is performed based on the optical flow estimation network FlowNet of deep learning, and example segmentation result calculation is performed based on the Centermask network. The scene flow estimation based on the examples considers each example as a whole and estimates the same motion vector, so that the example segmentation module only segments various vehicles and backgrounds.

With L⁰,R⁰,L¹,R¹Respectively representing two successive frames of a dual-purpose left-right picture, D⁰,D¹Disparity maps representing the 0 th and 1 st frames, respectively, F_L,F_RRespectively showing the corresponding optical flow diagrams of two continuous frames of left and right eyes,

the example division result of the 0 th frame of the left eye image is shown, and RT shows the rotation and translation matrix of the target between the front frame and the rear frame. The construction of the scene flow in the proposal comprises three parts, and an energy function shown in formula 1 is constructed from three levels of gray value, three-dimensional point cloud and optical flow of each example in the scene based on photometric error constraint, rigid fitting constraint and optical flow consistency constraint respectively.

E_i＝E_photo,i+E_rigid,i+E_flow,iFormula 1;

the photometric error constraint term, the rigid fitting constraint term and the optical flow consistency constraint term are respectively shown in formulas 2-4.

Wherein，α_pIs an indicator function indicating whether the current pixel is an outlier, and a value of 0 indicates an outlier. P represents the current instance P_iPixel point of (5), P_iRepresents L⁰Example collections

The ith example. And p' represents a pixel point of the three-dimensional point cloud of p in the 0 th frame of the example after being re-projected to the 1 st frame through the internal and external parameters of the camera. The transformation relationship can be expressed by the following formula:

wherein, pi_KAnd projecting the three-dimensional point cloud to a camera imaging plane by using camera internal parameters to obtain a two-dimensional image.

Representing the reconstruction of a three-dimensional point cloud by using two-dimensional coordinates and a disparity map through a camera stereo calibration parameter. q is the pixel point after adding the corresponding optical flow value to the pixel point p, and q is p + F_L(p)。

Optionally, the LSTM-based optimization of the scene flow energy function includes the following:

the LSTM network in this embodiment is an LSTM layer comprising two hidden layers and one linear layer. The input of the network is the gradient of the energy function to the RT, and the output is the update quantity of the RT. The input and output sizes are both (B, 6).

The LSTM network in this embodiment is a meta-optimizer, which has the same function as the traditional optimizer gauss-newton method, Adam, RMS, and predicts the update amount according to the gradient. The difference is that the LSTM network is data dependent, trainable.

Optionally, in the training set, for two consecutive frames of binocular images, a semantic segmentation result, a parallax estimation result, and an optical flow result are respectively calculated by using a semantic segmentation network, a parallax estimation network, and an optical flow estimation network trained in advance. From these results, the initial RT and corresponding energy function for each instance can be calculated.

In the embodiment, the LSTM network is trained and predicted in a mode of simulating gradient descent expansion, and the RT updating and the LSTM network updating are performed in turn in the training process. The implementation process is shown in fig. 4, and includes the following steps:

s401: training times and unfolding times are set, and LSTM is initialized randomly. The training times are the times of updating the LSTM network according to the training set, and the expansion times are the times of updating the RT in each training process.

S402: and judging whether the training frequency is less than the training frequency, if so, turning to S403, otherwise, outputting the model (S408), and finishing the training. Reading training data of a batch, wherein the training data comprises two continuous frames of binocular images, a semantic segmentation result, an optical flow and parallax result and initial RT.

S403: judging whether the expansion frequency is less than the expansion frequency, if not, turning to S410; if yes, go to S404;

S404-S405, S409: and calculating an energy function loss according to the RT, calculating the gradient of the loss to the RT, inputting LSTM to predict a corresponding RT update quantity delta RT, and updating the RT, wherein the RT is RT plus delta RT.

S406: summing the loss of each time in the preceding RT process to obtain sum _ loss, and updating the LSTM network (S410);

s407: update RT and then repeat S402-S410.

Optionally, the present embodiment predicts the LSTM optimization model in the following manner, as shown in fig. 5, and includes the following steps:

s501: given a set of left and right maps of consecutive frames, an initial RT and energy function representation can be obtained by example segmentation, disparity estimation, optical flow estimation.

S502: in the embodiment, optimization is respectively carried out by taking an example as a unit, and for each example, the LSTM expansion times n are given; if the expansion frequency is less than the expansion frequency, turning to S503, otherwise, turning to S508;

s503: obtaining a current loss value according to the RT input energy function, and turning to S507 (energy function collection);

s504: calculating the gradient of loss to the RT;

s505: inputting the update quantity of the RT into the LSTM;

s506: updating RT, and turning to S509(RT collection);

s508: and outputting the corresponding RT value when the loss value is minimum after updating for n times, wherein the value is the motion state of the example.

Optionally, in this embodiment, the LSTM optimization model effect includes: using KITTI data set as an example, comparison was performed with three optimizers LSTM, RMS, Adam, and the comparison results are shown in FIG. 6. As can be seen from fig. 6, LSTM greatly outperforms conventional RMS and Adam optimizers in both convergence speed and accuracy.

In summary, the LSTM optimization modules in this embodiment are trainable, and after training on a specific data set, the data set can be well fitted, so that higher scene flow accuracy can be obtained. Compared with the traditional optimization methods such as the Gauss-Newton method, Adam, RMS and the like, which need to carry out a large amount of calculation to obtain the parameter updating amount, the LSTM layer and linear layer structure which are only two hidden layers are adopted in the proposal, so that the calculation amount is small, the calculation efficiency is low, and great advantages are provided for the scene flow algorithm. The LSTM optimization network in the embodiment can be obtained by training under the condition of no scene flow truth value, the problem that scene flow results cannot be labeled is solved, and a new thought is provided for solving the scene flow precision by using a deep learning method. The LSTM optimization network in this embodiment is a problem-independent meta-learner, and is applicable to solving any objective function, and has very strong scalability.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, a device for determining a scene stream is further provided, where the device is used to implement the foregoing embodiments and preferred embodiments, and details are not repeated for what has been described. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 7 is a block diagram of a configuration of a scene flow determination apparatus according to an embodiment of the present invention, as shown in fig. 7, the apparatus including:

a first determining module 72, configured to determine processing parameters of pixel points in the acquired binocular image, where the processing parameters include an example segmentation result, an optical flow result, and a parallax result of the binocular image;

a second determining module 74 for determining a scene flow energy function of the binocular image based on the processing parameters;

a third determining module 76, configured to optimize the scene flow energy function to obtain a target scene flow energy function;

a fourth determining module 78, configured to determine a scene flow of pixel points in the binocular image based on the target scene flow energy function.

In an exemplary embodiment, the first determining module includes:

the binocular image acquisition device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring the binocular image by using binocular camera equipment, the binocular image comprises an Nth frame left image and an Nth frame right image, and N is a natural number greater than or equal to 1;

the first determining unit is used for carrying out example segmentation on the binocular image through a preset example segmentation network to obtain an example segmentation result;

the second determining unit is used for determining the displacement of the pixel points in the binocular image in a preset coordinate system through a preset optical flow estimation network to obtain an optical flow result;

a third determining unit, configured to determine, through a preset stereo matching network, a coordinate offset of a pixel point in the binocular image in the preset coordinate system, so as to obtain the parallax result;

a fourth determination unit configured to determine the division result, the optical flow result, and the parallax result as the processing parameters.

In an exemplary embodiment, the second determining module includes:

a fifth determining unit, configured to determine a luminosity error constraint term, a rigid fitting constraint term, and an optical flow consistency constraint term of the binocular image by using the example segmentation result, the optical flow result, the parallax result, and a rotation-translation matrix, wherein the rotation-translation matrix is determined based on a rotation translation amount of a target object in the binocular image in an adjacent frame image;

and a sixth determining unit, configured to determine a scene flow energy function of the binocular image by using the photometric error constraint term, the rigid fitting constraint term, and the optical flow consistency constraint term.

In an exemplary embodiment, the sixth determining unit includes:

a first determining subunit for determining E_i＝E_photo,i+E_rigid,i+E_flow,i(ii) a Wherein, E is as defined above_iFor representing the scene flow energy function, E_photo,iFor representing the above photometric error constraint term, E_rigid,iFor the rigid fitting constraint term and E_flow,iAnd i is an integer of 0 or more for representing the optical flow consistency constraint term.

In one exemplary embodiment, p' is determined by:

In an exemplary embodiment, the third determining module includes:

and a seventh determining unit, configured to optimize the scene stream energy function by using a long-term and short-term memory LSTM network to obtain the obtained target scene stream energy function, where the LSTM network includes a hidden layer LSTM and a linear layer.

In an exemplary embodiment, the seventh determining unit includes:

a third determining subunit, configured to determine a training number and an expansion number, where the training number is used to indicate a number of times for updating the LSTM network, and the expansion number is used to indicate a number of times for updating the rotation-translation matrix;

a fourth determining subunit, configured to initialize the LSTM network to obtain an initial LSTM network;

and a fifth determining subunit, configured to, when the number of times of updating the LSTM network is smaller than the training number of times and the number of times of updating the rotation-translation matrix is smaller than the expansion number of times, input the read sampling processing parameter and the initial rotation-translation matrix into the initial LSTM network to train the initial LSTM network, so as to obtain the obtained target scene flow energy function.

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

Embodiments of the present invention also provide a computer-readable storage medium having a computer program stored thereon, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

In the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image;

s2, determining a scene flow energy function of the binocular image based on the processing parameters;

s3, optimizing the scene flow energy function to obtain a target scene flow energy function;

and S4, determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

In an exemplary embodiment, the processor may be configured to execute the following steps by a computer program:

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining a scene stream, comprising:

determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image;

determining a scene flow energy function of the binocular image based on the processing parameters;

optimizing the scene flow energy function to obtain a target scene flow energy function;

and determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function.

2. The method of claim 1, wherein determining processing parameters for pixel points in the acquired binocular image comprises:

acquiring the binocular image by using binocular camera equipment, wherein the binocular image comprises an Nth frame left image and an Nth frame right image, and N is a natural number greater than or equal to 1;

carrying out example segmentation on the binocular image through a preset example segmentation network to obtain an example segmentation result;

determining the displacement of pixel points in the binocular image in a preset coordinate system through a preset optical flow estimation network to obtain an optical flow result;

determining coordinate offset of pixel points in the binocular image in the preset coordinate system through a preset stereo matching network to obtain the parallax result;

determining the segmentation result, the optical flow result, and the disparity result as the processing parameters.

3. The method of claim 1, wherein determining the scene flow energy function for the binocular image based on the processing parameters comprises:

determining photometric error constraint terms, rigid fitting constraint terms and optical flow consistency constraint terms of the binocular images by utilizing the example segmentation results, the optical flow results, the parallax results and a rotation and translation matrix, wherein the rotation and translation matrix is determined based on the rotation and translation amounts of target objects in the binocular images in adjacent frame images;

and determining a scene flow energy function of the binocular image by using the photometric error constraint term, the rigid fitting constraint term and the optical flow consistency constraint term.

4. The method of claim 3, wherein determining the scene flow energy function for the binocular image using the photometric error constraint term, a rigid fit constraint term, and an optical flow consistency constraint term comprises:

E_i＝E_photo,i+E_rigid,i+E_flow,i；

wherein, E is_iFor representing said scene flow energy function, said E_photo,iFor representing said photometric error constraint term, E_rigid,iFor representing said rigid fitting constraint termAnd E_flow,iFor representing the optical flow consistency constraint term, i is an integer greater than or equal to 0.

5. The method of claim 3, wherein determining photometric error, rigid fit, and optical flow consistency constraints terms for the binocular images using the instance segmentation results, optical flow results, disparity results, and a rotational-translation matrix comprises:

wherein, E is_photo,iFor representing said photometric error constraint term, E_rigid,iFor representing the rigid fitting constraint term and E_flow,iFor representing the optical flow consistency constraint term, i is an integer greater than or equal to 0, α_pFor representing whether a current pixel in the binocular image is an indicator function, the P being for representing a current instance P in the binocular image_iThe pixel point in (1), P_iFor representing the left image L in the binocular image⁰Example set of

And in the ith example, p' is used for representing the pixel point of the example after the three-dimensional point cloud of the nth frame p passes through the internal and external parameters of the camera equipment and is re-projected to the (N + 1) th frame, and RT is used for representing the rotation and translation matrix.

6. The method of claim 5, wherein p' is determined by:

wherein, the pi_KThe system comprises a camera device, a three-dimensional point cloud acquisition device, a three-dimensional image acquisition device and a three-dimensional image acquisition device, wherein the camera device is used for acquiring a three-dimensional point cloud; the above-mentioned

The three-dimensional point cloud is used for representing the three-dimensional point cloud which is reconstructed by the three-dimensional calibration parameters of the camera equipment by using the two-dimensional coordinates and the disparity map; and the q is used for representing the pixel point after the pixel point p in the binocular image is added with the optical flow value corresponding to the p.

7. The method of claim 1, wherein optimizing the scene flow energy function to obtain a target scene flow energy function comprises:

and optimizing the scene flow energy function by using a long-short term memory (LSTM) network to obtain the target scene flow energy function, wherein the LSTM network comprises a hidden layer LSTM and a linear layer.

8. The method of claim 7, wherein optimizing the scene flow energy function using a long-short term memory (LSTM) network to obtain the target scene flow energy function comprises:

determining training times and expansion times, wherein the training times are used for representing the times of updating the LSTM network, and the expansion times are used for representing the times of updating the rotation-translation matrix;

initializing the LSTM network to obtain an initial LSTM network;

and under the condition that the number of times of updating the LSTM network is less than the training number of times and the number of times of updating the rotation translation matrix is less than the expansion number of times, inputting the read sampling processing parameters and the initial rotation translation matrix into the initial LSTM network to train the initial LSTM network to obtain the target scene flow energy function.

9. The method of claim 1, wherein after optimizing the scene flow energy function to obtain a target scene flow energy function, the method further comprises:

and testing the target scene flow energy function by using the processing parameters of the pixel points in the binocular image and the initial rotation translation matrix RT.

10. An apparatus for determining a scene stream, comprising:

the first determining module is used for determining processing parameters of pixel points in the obtained binocular image, wherein the processing parameters comprise an example segmentation result, an optical flow result and a parallax error result of the binocular image;

a second determination module for determining a scene flow energy function of the binocular image based on the processing parameters;

the third determining module is used for optimizing the scene flow energy function to obtain a target scene flow energy function;

and the fourth determining module is used for determining the scene flow of the pixel points in the binocular image based on the target scene flow energy function.

11. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 9 when executed.

12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 9.