WO2020147574A1

WO2020147574A1 - Deep-learning-based stereo matching method for binocular dynamic vision sensor

Info

Publication number: WO2020147574A1
Application number: PCT/CN2019/130224
Authority: WO
Inventors: 陈广; 刘佩根; 沈律宇; 宁翔宇; 唐笠轩
Original assignee: 同济大学
Priority date: 2019-01-17
Filing date: 2019-12-31
Publication date: 2020-07-23
Also published as: CN109801314A; CN109801314B

Abstract

A deep-learning-based stereo matching method for a binocular dynamic vision sensor, the method comprising the following steps: 1) generating training point pairs according to depth information in a data set of a binocular event camera; 2) constructing a representation mode suitable for events in an event flow of a dynamic vision sensor; and 3) representing the event training point pairs according to the representation mode, and transmitting same to twin neural networks for training, and performing stereo matching according to a training result. Compared with the prior art, the method has the advantages of high matching precision and a fast matching speed.

Description

A stereo matching method for binocular dynamic vision sensor based on deep learning

Technical field

The invention relates to the technical field of image matching, in particular to a stereo matching method for binocular dynamic vision sensors based on deep learning.

Background technique

The dynamic vision sensor outputs a stream of events by detecting changes in the logarithmic intensity of image brightness, where each event has location, polarity, and time stamp information. Compared with traditional cameras, it has the advantages of low latency, high time resolution and large dynamic range.

In traditional image processing technology, binocular stereo matching technology is an important way to obtain image depth information. However, due to the large amount of output data of traditional vision sensors and high resource consumption, the application of traditional binocular stereo matching technology on the mobile terminal is greatly restricted.

Summary of the invention

The purpose of the present invention is to provide a stereo matching method for binocular dynamic vision sensor based on deep learning in order to overcome the defects of the prior art.

The object of the present invention can be achieved by the following technical solutions:

A stereo matching method for binocular dynamic vision sensor based on deep learning, including the following steps:

1) Generate training point pairs based on the depth information in the binocular event camera data set;

2) Construct a representation method suitable for events in the event stream of dynamic vision sensors;

3) According to the representation method, the event training point pair is characterized, and sent to the twin neural network for training, and stereo matching is performed according to the training result.

The step 1) specifically includes the following steps:

11) Randomly select an event as a point of interest within the field of view of the dynamic vision sensor on the left;

12) According to the position information of the point of interest in the left sensor and the real depth information, using the epipolar line as the limit, project it onto the right dynamic vision sensor to obtain the position coordinate information of the point of interest in the right sensor, Form training point pairs.

In the step 12), the calculation formula of the position coordinates (x _R , y _R ) of the point of interest in the right sensor is:

Among them, (x _L , y _L ) are the position coordinates of the point of interest in the left sensor, d is the parallax value, z is the corresponding depth information, and b and f are the baseline distance and focal length of the binocular dynamic vision sensor.

In the step 2), the characterization method of the construction event is specifically:

21) Taking the characterization point as the geometric center, establish a square area with side length L and aligned with the sensor viewing angle, and divide the square area into equal N*N square small areas, where N is an odd number;

22) Select consecutive S (S is an even number) time interval Δt, so that the event timestamp of the characteristic point is located at

Count the number of events c _i generated in each small square area in each time interval Δt;

23) Normalize the number of events in each small square in different time intervals Δt, as the value of the small square, then:

c _max ＝max(c _i )

Among them, _mi is the normalized value, and c _max is the maximum number of events in each small square counted in different time intervals Δt;

24) Sort the normalized value _mi from small to large to form an N*N*S-dimensional representation vector.

In the step 3), using the twin neural network training event training point pair specifically includes the following steps:

31) Send the representation vector of the matched training point pair to the twin neural network, and output its respective M-dimensional description vector;

32) Calculate the Euclidean distance between the generated M-dimensional description vectors, and adjust the parameters of the twin neural network to reduce the distance value;

33) Send the representation vectors of the two unmatched event points to the twin neural network after adjusting the parameters, and output their respective M-dimensional description vectors;

34) Calculate the Euclidean distance between the M-dimensional description vectors generated by the two unmatched event points, adjust the neural network parameters, and expand the distance value;

35) Perform stereo matching.

In the step 4), the number of representations of matched and unmatched event point pairs are sent to the twin neural network in equal numbers.

Compared with the prior art, the present invention has the following advantages:

1. The present invention can effectively solve the problem of stereo matching for dynamic vision sensors. It directly processes data on the generated event stream, which can effectively reduce the amount of calculation, reduce the required computing resources, and improve the matching speed, which is easy to implement on the mobile terminal.

2. The present invention uses event distribution information around the point of interest to characterize the point of interest, and the used information is rich and stable. And apply a large amount of data to train the neural network, and perform stereo matching based on deep learning, which can make the matching method more robust and improve the matching accuracy.

BRIEF DESCRIPTION

Figure 1 is a flow chart of the stereo matching of the present invention.

Figure 2 is a schematic plan view of the characterization method.

Figure 3 is a partial representation diagram.

Figure 4 is a schematic diagram of the twin neural network.

detailed description

The present invention will be described in detail below with reference to the drawings and specific embodiments.

Examples

The present invention provides a stereo matching method for binocular dynamic vision sensors based on deep learning. The method can characterize the event stream output by the left and right dynamic vision sensors, and perform matching through a trained neural network to improve the matching accuracy. At the same time improve the matching speed. The method includes the following steps:

(1) Generate training point pairs based on the data set of the existing binocular event camera and the depth information provided by it;

(2) Construct a representation method suitable for event stream events of dynamic vision sensors;

(3) Use the constructed characterization method to characterize the event training point pair and send it to the neural network for training.

In step (1), the method of generating event training point pairs is as follows:

(2-1) Randomly select an event as a point of interest within the field of view of the dynamic vision sensor on the left.

(2-2) Taking the vertex of the upper left corner of the sensor as the origin, and taking the positive and right directions as the x and y positive semi-axes respectively, record the position information (x _L , y _L ) of the point of interest. According to the projection principle of the binocular camera, the coordinates (x _R , y _R ) of the corresponding point on the right side should satisfy:

Among them, d is the parallax value, and the calculation formula is:

Where z is the depth information corresponding to the event point, b and f are the baseline distance and focal length of the binocular dynamic vision sensor, which are known quantities.

In step (2), the method of constructing event representation is as follows:

(3-1) Taking the characterizing point as the geometric center, establish a square with side length L and aligned with the sensor's viewing angle, and divide this square into equal N*N square small areas, as shown in Figure 2. In this embodiment, the side length L is 33 pixels, and N is 11, that is, there are 121 small squares, and the side length of each small square is 3 pixels.

(3-2) Take S consecutive time intervals Δt so that the selected event timestamp is at

Count the number of events c _i generated in each small square area in each time interval Δt, as shown in Figure 3.

(3-3) Normalize the number of events in each small square in different time intervals Δt as the value of the small square. The normalization formula is:

c _max ＝max(c _i )

Among them, _mi is the normalized value, and c _max is the maximum number of events in each small square counted in different time intervals Δt.

(3-4) Sort _mi from small to large to form an N*N*S-dimensional representation vector.

In step (3), the training method for the representation is as follows:

(4-1) Using the method described in step (1), take multiple different time points on the existing binocular event camera data set, and generate multiple event point pairs at different locations at each time point. An event point is characterized separately to obtain an N*N*S-dimensional representation vector, which is sent to the twin neural network, and an M-dimensional description vector is output. In this embodiment, the neural network is shown in Figure 4.

(4-2) Calculate the Euclidean distance between the M-dimensional description vectors generated by the corresponding point pair, and adjust the neural network parameters to reduce the distance value.

(4-3) Similarly, the representations of the two unmatched event points are sent to the aforementioned neural network, and their respective M-dimensional description vectors are output.

(4-4) Calculate the Euclidean distance between the two vectors of the mismatched point pair, adjust the neural network parameters, and expand the distance value. During the training process, the representations of the matched and mismatched event point pairs are sent to the twin neural network. The number is equal.

(4-5) Perform stereo matching.

For each new event generated by the dynamic vision sensor on the left, a representation is established and sent to the trained neural network to generate a description vector. At the same time, characterize all the positions on the same epipolar line in the right sensor in turn, send them to the neural network to generate description vectors, calculate and compare the Euclidean distance between the description vectors generated by the characterization on both sides, and take the smallest distance , And use the position corresponding to the description vector on the right as the matching point.

The above description of the embodiments is to facilitate those of ordinary skill in the art to understand and apply the present invention. Those skilled in the art can obviously easily make various modifications to these embodiments, and apply the general principles described herein to other embodiments without creative work. Therefore, the present invention is not limited to the embodiments herein. Based on the disclosure of the present invention, those skilled in the art should make improvements and modifications without departing from the scope of the present invention within the protection scope of the present invention.

Claims

A stereo matching method for binocular dynamic vision sensor based on deep learning, characterized in that it comprises the following steps:

1) Generate training point pairs based on the depth information in the binocular event camera data set;

2) Construct a representation method suitable for events in the event stream of dynamic vision sensors;

3) According to the representation method, the event training point pair is characterized, and sent to the twin neural network for training, and stereo matching is performed according to the training result.
The stereo matching method of binocular dynamic vision sensor based on deep learning according to claim 1, wherein said step 1) specifically includes the following steps:

11) Randomly select an event as a point of interest within the field of view of the dynamic vision sensor on the left;

12) According to the position information of the point of interest in the left sensor and the real depth information, using the epipolar line as the limit, project it onto the right dynamic vision sensor to obtain the position coordinate information of the point of interest in the right sensor, Form training point pairs.
The stereo matching method of binocular dynamic vision sensor based on deep learning according to claim 2, characterized in that, in said step 12), the position coordinates (x R , y R ) Is calculated as:

Among them, (x L , y L ) are the position coordinates of the point of interest in the left sensor, d is the parallax value, z is the corresponding depth information, and b and f are the baseline distance and focal length of the binocular dynamic vision sensor.
The stereo matching method of binocular dynamic vision sensor based on deep learning according to claim 1, wherein in said step 2), the representation method of constructing event is specifically:

21) Taking the characterization point as the geometric center, establish a square area with side length L and aligned with the sensor's viewing angle, and divide this square area into equal N*N square small areas;

22) Select consecutive S time intervals Δt, so that the event timestamp of the characteristic point is located at
Count the number of events c i generated in each small square area in each time interval Δt;

23) Normalize the number of events in each small square in different time intervals Δt, as the value of the small square, then:

c max ＝max(c i )

Among them, mi is the normalized value, and c max is the maximum number of events in each small square counted in different time intervals Δt;

24) Sort the normalized value mi from small to large to form an N*N*S-dimensional representation vector.
The stereo matching method of binocular dynamic vision sensor based on deep learning according to claim 1, characterized in that, in said step 3), using twin neural network training event training point pair specifically includes the following steps:

31) Send the representation vector of the matched training point pair to the twin neural network, and output its respective M-dimensional description vector;

32) Calculate the Euclidean distance between the generated M-dimensional description vectors, and adjust the parameters of the twin neural network to reduce the distance value;

33) Send the representation vectors of the two unmatched event points to the twin neural network after adjusting the parameters, and output their respective M-dimensional description vectors;

34) Calculate the Euclidean distance between the M-dimensional description vectors generated by the two unmatched event points, adjust the neural network parameters, and expand the distance value;

35) Perform stereo matching.
The stereo matching method of binocular dynamic vision sensor based on deep learning according to claim 5, wherein in said step 4), the number of the representations of matched and unmatched event point pairs sent to the twin neural network equal.