CN113269699B

CN113269699B - Optical flow estimation method and system based on fusion of asynchronous event flow and gray level image

Info

Publication number: CN113269699B
Application number: CN202110436248.2A
Authority: CN
Inventors: 史殿习; 刘聪; 苏雅倩文; 金松昌; 杨烟台; 景罗希; 李雪辉
Original assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center; National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: Tianjin (binhai) Intelligence Military-Civil Integration Innovation Center; National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2021-04-22
Filing date: 2021-04-22
Publication date: 2023-01-03
Anticipated expiration: 2041-04-22
Also published as: CN113269699A

Abstract

The application relates to the technical field of optical flow estimation in computer vision, in particular to an optical flow estimation method and system based on fusion of asynchronous event streams and gray level images. The method comprises the following steps: acquiring an asynchronous event stream and a synchronous gray image; preprocessing the asynchronous event stream and the synchronous gray level image to obtain an event frame and a gray level image; performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a multi-channel composite image, performing pooling on the composite image to obtain a region feature matrix, superposing the region feature matrix to obtain a corresponding tensor, and inputting the tensor into a weight adaptive extraction network for fusion; and inputting the fused image obtained after fusion into a trained optical flow estimation depth neural network to obtain a final optical flow estimation result. The method and the system organically fuse the event stream and the gray image, and can improve the robustness and the generalization of the optical flow estimation algorithm.

Description

Optical flow estimation method and system based on fusion of asynchronous event flow and gray level image

Technical Field

The application relates to the technical field of optical flow estimation in computer vision, in particular to an optical flow estimation method and system based on fusion of asynchronous event streams and gray images.

Background

For the optical flow estimation visual task, a clear and stable input is a basic premise to ensure the performance of the algorithm. In a common scene, a traditional camera can capture a clearer image, but in complex conditions of high-speed motion, strong illumination change and the like, the performance of the optical flow estimation method based on the traditional camera is affected. The event camera is a novel bionic visual sensor, has the characteristics of high dynamic range, high time resolution, low delay and the like, and is particularly suitable for high-speed dynamic scenes. The existing optical flow estimation method only utilizes an event camera or only utilizes a traditional camera, and the combination of the complementary two is beneficial to improving the robustness of an optical flow estimation algorithm.

The event camera outputs an asynchronous event stream, and the unique imaging mode of the event camera enables the output data to be unstable and unreliable under the condition that the illumination change is not obvious enough or the relative motion between the event camera and the environment does not change greatly, and the output data contains a lot of noises. In this case, the conventional camera can complementarily capture a clearer image. There are many optical flow estimation algorithms facing asynchronous event streams or traditional cameras, but there is no method for optical flow estimation combining the two.

The output of the event camera is an asynchronous event stream, which is essentially different from the conventional grayscale image in data format, and the first thing to fuse the two is to process the event stream into data having the same format as the grayscale image. Common methods are an event count representation and a recent timestamp representation of the event stream. In other visual task fields, the method of merging visible event streams with traditional images is most often a direct merging method based on channel expansion. The fusion method does not consider that the imaging quality of the event frame is different from that of the traditional image under different conditions, but proportionally fuses the event frame and the traditional image. Meanwhile, the difference of the two quality distributions also exists in different areas of one image.

Therefore, the present application proposes an optical flow estimation method and system based on asynchronous event stream and grayscale image fusion to at least partially solve the above technical problems.

Disclosure of Invention

The method of the invention changes the asynchronous event stream into the same format as the synchronous gray image, and then fuses the asynchronous event stream and the synchronous gray image, thereby improving the robustness and the generalization of the optical flow estimation method.

In order to achieve the technical purpose, the application provides an optical flow estimation method based on the fusion of an asynchronous event flow and a gray level image, which comprises the following steps:

acquiring an asynchronous event stream and a synchronous gray image;

preprocessing the asynchronous event stream and the synchronous gray level image to obtain an event frame and a gray level image;

performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a multi-channel composite image, pooling the composite image to obtain a regional characteristic matrix, superposing the regional characteristic matrix to obtain a corresponding tensor, and inputting the tensor to a weight adaptive extraction network for fusion;

and inputting the fused image obtained after fusion into a trained optical flow estimation depth neural network to obtain a final optical flow estimation result.

Preferably, the acquiring asynchronous event stream and the synchronous grayscale image are obtained through an event camera, scene construction and motion control.

Preferably, the multiple channels are five channels.

Further, the preprocessing the asynchronous event stream and the synchronous grayscale image to obtain an event frame and a grayscale image specifically includes:

dividing a discrete event into two subjects according to a left binocular camera and a right binocular camera, and respectively corresponding to the event and a gray image;

packaging all events into a Rosbag format file package;

storing the gray level image in a jpg format, and labeling the gray level image according to a time sequence to obtain a gray level image sequence;

and receiving the Rosbag format file packet and the gray level image sequence, and superposing the asynchronous event stream in the Rosbag format file packet into a format of a synchronous frame with the same size as the gray level image, and recording the format of the synchronous frame as an event frame.

Further, the representation method of the event frame adopts an event counting image or a latest timestamp image representation method.

Further, channel superposition and expansion are performed on the event frame and the gray level image according to time alignment to obtain a five-channel composite image, pooling is performed on the five-channel composite image to obtain a regional characteristic matrix, the regional characteristic matrix is superposed to obtain a corresponding tensor, and then the tensor is input to the adaptive extraction network for fusion, specifically comprising:

receiving five frames of event frames and gray level images with the same size, wherein the five frames comprise a counting image 1 frame of a positive event and a latest timestamp image 1 frame thereof, a counting image 1 frame of a negative event and a latest timestamp image 1 frame thereof, and a gray level image 1 frame;

performing channel superposition expansion on the five frames according to time alignment to obtain a composite image containing five channels;

dividing each channel of the composite image into 16 small areas of 4 multiplied by 4 in proportion, then carrying out area average pooling on each small area, and reducing all pixels in the pooled area into a value for replacement to obtain an area characteristic matrix of 4 multiplied by 4;

and superposing the regional characteristic matrixes to obtain tensors with the size of 4 multiplied by 5, and inputting the tensors into a weight self-adaptive extraction network for fusion.

Preferably, the weight adaptive extraction network is implemented by two fully connected layers, the activation function of the full connection of the first layer is a ReLU activation function, the activation function of the full connection of the second layer is a Sigmoid activation function, and the output of the weight adaptive extraction network is a tensor of 4 × 4 × 5 with the same size as the input.

Preferably, the training step of the optical flow estimation deep neural network comprises:

acquiring a fusion image of the event frame and the gray level image;

performing convolution coding and up-sampling decoding on the fused image;

calculating the self-supervision loss, wherein the loss function comprises a smooth error and a photometric error;

and updating the network connection parameters.

Preferably, the optical flow estimation deep neural network employs a 4-layer encoding layer, a 2-layer residual layer, and a 4-layer decoding layer.

The second aspect of the present invention further provides an optical flow estimation system based on asynchronous event stream and grayscale image fusion, including:

the system bottom layer module is used for acquiring an asynchronous event stream and a synchronous gray level image;

the event frame generation module is used for preprocessing the asynchronous event stream and the synchronous gray level image to obtain an event frame and a gray level image;

the event frame and gray level image fusion module is used for performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a multi-channel composite image, pooling the composite image to obtain a region feature matrix, superposing the region feature matrix to obtain a corresponding tensor, and inputting the tensor into a weight adaptive extraction network for fusion;

and the optical flow estimation depth neural network module is used for inputting the fused image obtained after fusion into the trained optical flow estimation depth neural network to obtain a final optical flow estimation result.

The beneficial effect of this application does:

the optical flow estimation method and system based on the fusion of the asynchronous event stream and the gray level image realize the better fusion of the original three-dimensional asynchronous event stream and the gray level image, finally obtain the clear two-dimensional image under the condition that the data formats of the two images are completely different, organically fuse the two images, and further improve the robustness and the generalization of an optical flow estimation algorithm.

Drawings

FIG. 1 shows a schematic flow chart of the method of embodiment 1 of the present application;

FIG. 2 is a schematic diagram showing a process of fusion in example 1 of the present application;

FIG. 3 is a schematic diagram showing the overall process of optical flow estimation in embodiment 1 of the present application;

FIG. 4 is a view showing a system configuration employed in embodiment 2 of the present application;

fig. 5 shows a system structure and a process diagram in embodiment 3 of the present application.

Detailed Description

Hereinafter, embodiments of the present application will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present application. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present application. It will be apparent to one skilled in the art that the present application may be practiced without one or more of these details. In other instances, well-known features have not been described in order to avoid obscuring the present application.

It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments in accordance with the application. As used herein, the singular is intended to include the plural unless the context clearly dictates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Exemplary embodiments according to the present application will now be described in more detail with reference to the accompanying drawings. These exemplary embodiments may, however, be embodied in many different forms and should not be construed as limited to only the embodiments set forth herein. The figures are not drawn to scale, wherein certain details may be exaggerated and omitted for clarity. The shapes of the various regions, layers and their relative sizes, positional relationships are shown in the drawings as examples only, and in practice deviations due to manufacturing tolerances or technical limitations are possible, and a person skilled in the art may additionally design regions/layers with different shapes, sizes, relative positions according to the actual needs.

Example 1:

the embodiment implements an optical flow estimation method based on asynchronous event stream and grayscale image fusion, as shown in fig. 1, including the following steps:

s1, acquiring an asynchronous event stream and a synchronous gray image;

s2, preprocessing the asynchronous event stream and the synchronous gray level image to obtain an event frame and a gray level image;

s3, performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a multi-channel composite image, pooling the composite image to obtain a regional characteristic matrix, superposing the regional characteristic matrix to obtain a corresponding tensor, and inputting the tensor to a weight adaptive extraction network for fusion;

and S4, inputting the fused image obtained after fusion into the trained optical flow estimation depth neural network to obtain a final optical flow estimation result.

Preferably, the multiple channels are five channels.

In S2, the preprocessing the asynchronous event stream and the synchronous grayscale image to obtain an event frame and a grayscale image specifically includes:

dividing a discrete event into two subjects (topic) according to a left binocular camera and a right binocular camera, wherein the subjects correspond to the event and a gray level image respectively;

packaging all events into Rosbag format file packets, and setting the time length of each Rosbag format file packet to be 10 seconds in order to control the size of Rosbag files and reduce the time for a system to read data;

and receiving the Rosbag format file packet and the gray level image sequence, and superposing the asynchronous event stream in the Rosbag format file packet into a format of a synchronous frame with the same size as the gray level image by taking the inter-frame time interval of the gray level image as an accumulation time span, and marking as an event frame.

All of the above operations prior to obtaining an event frame are the preprocessing.

Further, the representation method of the event frame adopts an event counting image or a latest time stamp image representation method.

In S3, performing channel superposition expansion on the event frame and the grayscale image according to time alignment to obtain a five-channel composite image, pooling the five-channel composite image to obtain a region feature matrix, superposing the region feature matrix to obtain a corresponding tensor, and inputting the tensor into an adaptive extraction network for fusion, specifically including:

in order to obtain the characteristics of each region and in consideration of the size of data quantity, dividing each channel of the composite image into 16 small regions of 4 × 4 in proportion, then performing region average pooling on each small region, and reducing all pixels in the pooled region into one value to replace the value to obtain a region characteristic matrix of 4 × 4 size;

and superposing the region feature matrixes to obtain tensors with the sizes of 4 multiplied by 5, and inputting the tensors into a weight adaptive extraction network for fusion.

Preferably, the weight adaptive extraction network is implemented by two fully connected layers, the activation function of the full connection of the first layer is a ReLU activation function, the activation function of the full connection of the second layer is a Sigmoid activation function, and the output of the weight adaptive extraction network is a tensor of 4 × 4 × 5 with the same size as the input. FIG. 2 is a schematic diagram showing the fusion process, in FIG. 2, F represents the acronym of function, lsq represents the abbreviation of local squeeze, and ex represents the abbreviation of extract-Net. The asynchronous event x is originally three-dimensional, and is subjected to channel expansion into five channels (c is the initial of a channel and represents the channel) according to time alignment, namely the five frames, wherein the five frames comprise a count image 1 frame of a positive event and a latest timestamp image 1 frame thereof, a count image 1 frame of a negative event and a latest timestamp image 1 frame thereof, and a grayscale image 1 frame. The frame and the channel are synonymous, each channel is divided into 16 small areas of 4 multiplied by 4 in proportion, then each small area is subjected to area average pooling, all pixels in the pooled area are reduced to one value to replace the value, a 4 multiplied by 4 sized area feature matrix is obtained, the area feature matrices are superposed to obtain a 4 multiplied by 5 sized tensor, then the tensor is input to a weight adaptive extraction network to be subjected to adaptive weight calculation, asynchronous events are fused with gray level images, and x is changed into 16 small areas

The weight self-adaptive extraction network realizes the fusion of the event frame and the gray level image, and if the channel 1 is clearer compared with the event frame, the weight self-adaptive extraction network assigns larger weight to the event frame, so that the self-adaptive extraction network realizes the self-adaptive extractionAnd (4) performing sexual weight calculation to enable the asynchronous event stream and the synchronous gray level image to be effectively fused.

Preferably, the step of training the optical flow estimation depth neural network in S4 includes:

acquiring a fusion image of the event frame and the gray level image;

performing convolutional encoding and upsampling decoding on the fused image;

and updating the network connection parameters.

An overall optical flow estimation process schematic diagram based on the fusion of the asynchronous event stream and the gray-scale image is shown in fig. 3, and in fig. 3, the weight adaptive extraction network is abbreviated as LSENet. Further preferably, the optical flow estimation deep neural network employs a 4-layer encoding layer, a 2-layer residual layer, and a 4-layer decoding layer. The coding layer is realized by convolution with the step length of 2, the size of the image is reduced by half and the number of channels is doubled after passing through each coding layer, and meanwhile, in order to enhance the adaptability of the system to the images with different sizes, the data obtained by each coding layer is stored for jump connection with the corresponding coding layer. The residual layer is introduced to prevent overfitting. The decoding layer is realized by bilinear interpolation up-sampling, the size of data is doubled after each layer of original data passes through the decoding layer, and the number of channels is reduced by half. The decoded data is input to the next decoding layer together with the data from the corresponding encoding layer. The network training is realized in an automatic supervision mode, and the loss function comprises a smooth error and a photometric error.

Example 2:

the embodiment implements an optical flow estimation system based on fusion of an asynchronous event stream and a grayscale image, as shown in fig. 4, including:

a system bottom module 501, configured to obtain an asynchronous event stream and a synchronous grayscale image;

an event frame generation module 502, configured to pre-process the asynchronous event stream and the synchronous grayscale image to obtain an event frame and a grayscale image;

an event frame and grayscale image fusion module 503, configured to perform channel superposition and expansion on the event frame and grayscale image according to time alignment to obtain a multi-channel composite image, perform pooling on the composite image to obtain a region feature matrix, perform superposition on the region feature matrix to obtain a corresponding tensor, and input the tensor into a weight adaptive extraction network for fusion;

and an optical flow estimation depth neural network module 504, configured to input the fused image obtained after the fusion into the trained optical flow estimation depth neural network, so as to obtain a final optical flow estimation result.

Example 3:

the embodiment implements an optical flow estimation system based on asynchronous event stream and grayscale image fusion, as shown in fig. 5, including: the system comprises a system bottom layer module, an event frame generation module, an event frame and gray level image fusion module and an optical flow estimation depth neural network module.

The system bottom layer module is loaded with an agent of an event camera, can synchronously acquire asynchronous event streams and traditional gray level images, and generally comprises a motion control module and a scene construction module which are carried by the agent. The ROS and Ubuntu operating systems receive asynchronous event streams from the event cameras DAVIS 346, divide discrete independent event points into two topics according to the difference of left and right binocular cameras, and then pack all events in each topic into a Rosbag format file packet, so that subsequent processing and screening of events are facilitated. The gray level images are stored in a jpg format and are labeled according to the time sequence to obtain a gray level image sequence. And receiving the Rosbag format file packet and the gray level image sequence, and superposing the asynchronous event stream in the Rosbag format file packet into a format of a synchronous frame with the same size as the gray level image by taking the inter-frame time interval of the gray level image as an accumulation time span, and marking as an event frame.

And the event frame generation module is used for preprocessing the asynchronous event stream and the synchronous gray level image to obtain an event frame and a gray level image, and all operations before the event frame and the gray level image are obtained are preprocessing.

And the event frame and gray level image fusion module is used for performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a multi-channel composite image, pooling the composite image to obtain a region characteristic matrix, superposing the region characteristic matrix to obtain a corresponding tensor, and inputting the tensor into the weight adaptive extraction network for fusion. Namely, the tensor input (Extraction-Net) is subjected to adaptive weight learning, the network is realized by two layers of fully-connected networks, the activation function of the full connection of the first layer is a ReLU activation function, the activation function of the full connection of the second layer is a Sigmoid activation function, and finally, the tensor with the same size as the input tensor of 4 multiplied by 5 is output.

And the optical flow estimation depth neural network module is used for receiving the fusion image output by the event frame and gray image fusion module, estimating the depth neural network through the trained optical flow and obtaining a final optical flow estimation result. The optical flow estimation deep neural network adopts a 4-layer coding layer, a 2-layer residual error layer and a 4-layer decoding layer. The coding layer is realized by convolution with the step length of 2, the size of the image is reduced by half and the number of channels is doubled after passing through each coding layer, and meanwhile, in order to enhance the adaptability of the system to the images with different sizes, the data obtained by each coding layer is stored for jump connection with the corresponding coding layer. The residual layer is introduced to prevent overfitting. The decoding layer is realized by bilinear interpolation up-sampling, the size of data is doubled after each layer of original data passes through the decoding layer, and the number of channels is reduced by half. The decoded data will be input to the next decoding layer together with the data from the corresponding encoding layer. The network training is realized in an automatic supervision mode, and the loss function comprises a smooth error and a photometric error.

As an alternative embodiment, as shown in fig. 5, the optical flow estimation depth neural network module in this embodiment outputs the color optical flow estimation graph, which can be input to downstream applications, such as emergency obstacle avoidance, visual odometer, video analysis, and weather prediction.

The above description is only for the preferred embodiment of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An optical flow estimation method based on asynchronous event flow and gray level image fusion is characterized by comprising the following steps:

acquiring an asynchronous event stream and a synchronous gray image;

performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a multi-channel composite image, performing pooling on the composite image to obtain a region feature matrix, superposing the region feature matrix to obtain a corresponding tensor, and inputting the tensor into a weight adaptive extraction network for fusion;

inputting the fused image obtained after fusion into a trained optical flow estimation depth neural network to obtain a final optical flow estimation result;

wherein the multiple channels are five channels;

performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a five-channel composite image, pooling the five-channel composite image to obtain a regional characteristic matrix, superposing the regional characteristic matrix to obtain a corresponding tensor, and inputting the tensor into a self-adaptive extraction network for fusion, wherein the method specifically comprises the following steps of:

dividing each channel of the composite image into 16 small areas of 4 multiplied by 4 in proportion, then carrying out area average pooling on each small area, and reducing all pixels in the pooled area into a value to replace the value to obtain an area characteristic matrix of 4 multiplied by 4 size;

2. The method of claim 1, wherein the asynchronous event stream and the synchronous grayscale image are obtained by an event camera, scene construction and motion control.

3. The method for estimating optical flow based on the fusion of asynchronous event stream and gray image as claimed in claim 1, wherein the preprocessing the asynchronous event stream and the synchronous gray image to obtain an event frame and a gray image specifically comprises:

dividing a discrete event into two themes according to a left binocular camera and a right binocular camera, and respectively corresponding to the event and a gray image;

packaging all events into a Rosbag format file package;

4. The method for estimating optical flow based on asynchronous event stream and gray image fusion as claimed in claim 3, wherein the representation method of the event frame is event counting image or latest time stamp image representation.

5. The method for estimating optical flow based on the fusion of asynchronous event flow and gray image according to claim 1, wherein the weight adaptive extraction network is implemented by two fully connected layers, the activation function of the first layer is a ReLU activation function, the activation function of the second layer is a Sigmoid activation function, and the output of the weight adaptive extraction network is a 4 × 4 × 5 tensor with the same size as the input.

6. The method for estimating optical flow based on fusion of asynchronous event stream and gray image as claimed in claim 1, wherein the step of training the optical flow estimation depth neural network comprises:

acquiring a fusion image of the event frame and the gray level image;

performing convolutional encoding and upsampling decoding on the fused image;

and updating the network connection parameters.

7. The method of claim 6, wherein the deep neural network employs 4 coding layers, 2 residual layers and 4 decoding layers.

8. An optical flow estimation system based on asynchronous event stream and gray level image fusion, comprising:

the optical flow estimation depth neural network module is used for inputting the fused image obtained after fusion into a trained optical flow estimation depth neural network to obtain a final optical flow estimation result;

wherein the multiple channels are five channels;

performing channel superposition expansion on the event frame and the gray level image according to time alignment to obtain a composite image of five channels, performing pooling on the composite image to obtain a region feature matrix, performing superposition on the region feature matrix to obtain a corresponding tensor, and inputting the tensor into a self-adaptive extraction network for fusion, wherein the method specifically comprises the following steps of: