CN114723009B

CN114723009B - Data representation method and system based on asynchronous event stream

Info

Publication number: CN114723009B
Application number: CN202210379153.6A
Authority: CN
Inventors: 古富强; 郭方明; 陈超; 黄柳金; 刘淑文; 张栋宇
Original assignee: Chongqing University
Current assignee: Chongqing University
Priority date: 2022-04-12
Filing date: 2022-04-12
Publication date: 2023-04-25
Anticipated expiration: 2042-04-12
Also published as: CN114723009A

Abstract

The invention relates to a data representation method and system based on an asynchronous event stream, and belongs to the technical field of deep learning feature engineering. The method obtains richer representation of asynchronous event data by splicing event size (MOE) information and event Density (DOE) information, and comprises the following steps: s1, inputting an asynchronous event stream as event data; s2, acquiring event size MOE; s3, acquiring an event density DOE; s4, splicing the MOE and the DOE to obtain MDOE expression, wherein the obtained MDOE simultaneously comprises size information of event data and density information of the event. Compared with the existing event-based representation method, the MDOE method provided by the invention contains more abundant information about the polarity, time and density of the event, and has two advantages: 1) It is a conceptual simple generic representation, task independent. 2) The present method achieves superior performance relative to existing representations on various event-based datasets.

Description

Data representation method and system based on asynchronous event stream

Technical Field

The invention belongs to the technical field of deep learning feature engineering, and relates to a data representation method and system based on asynchronous event streams.

Background

Asynchronous event streams refer to output event streams through event-based sensors (DVS cameras) having a higher dynamic range, higher event resolution, lower time delay, and better energy efficiency than conventional devices (RGB cameras), and are widely used in fields such as autopilot, safety monitoring, and industrial automation. However, due to the nature of the asynchronous event stream itself, learning from these data presents many challenges, which cannot be directly used for training of convolutional neural networks.

Learning a useful representation that can be exploited by convolutional neural networks through the use of asynchronous event streams is one approach to solving this challenge. Currently, the asynchronous event stream representation mode mainly comprises an event representation through artificial design, a pulse-based event representation, a learning-based event representation and the like. A naive idea of the artificial-based event representation approach is to use the number of events that occur per pixel, but this way event information and polarity information of the events are discarded, which may degrade the performance of downstream applications. There are many improvements in this respect, such as taking polarity information into account, introducing an event surface concept to describe the recent history of event spatial neighborhood events, but this approach is susceptible to noise events due to the relatively expensive computation of the time surface, and only taking into account the timestamp of the last event in the vicinity of the event. The impulse-based representation mainly utilizes impulse neural networks, which are used for many event-based tasks, but their usability in real-world scenarios is limited because impulse neural network neurons are not microscopic. There are ways to train the neural network with frame-based data and convert the learned data to a pulsed neural network and to approximate the pulse derivatives with a similar function, but the accuracy achieved by the pulse-based approach is still lower than the deep learning approach to date. There are many learning-based representation methods currently handling asynchronous event data, such as a grid-based representation method called EST, which can learn end-to-end features directly from asynchronous event-based data by microkernel convolution and quantization, but which need to be adapted according to downstream tasks. In the Matrix LSTM approach, a grid of LSTM cells is used to learn an end-to-end event-based representation that can capture local temporal features while preserving spatial structure. PhaseLSTM is a variant of the LSTM unit in which the network uses time gates to learn the exact timing of incoming events, extracting relevant features through the word embedding layer, but does not capture generic features well due to the limited representation capabilities of the embedding layer. Current presentation methods have the problem of discarding one or more types of information regarding event polarity, density, or time information, which may lead to downstream program performance degradation.

Therefore, it is necessary to focus on how to efficiently represent the event stream data-based method to expand the application of the event data-oriented learning model.

Disclosure of Invention

In view of the above, the present invention is directed to a method and a system for representing data based on an asynchronous event stream, which can obtain a richer representation of asynchronous event data by concatenating event size (MOE) information and event Density (DOE) information, so as to solve the problem that one or more types of information in event polarity, density or time information are lost in the existing method for representing data based on an asynchronous event stream, and avoid the performance degradation of downstream programs.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a method of data representation based on an asynchronous event stream, the method comprising the steps of: s1, inputting an asynchronous event stream as event data; s2, acquiring event size MOE; s3, acquiring an event density DOE; s4, splicing the MOE and the DOE to obtain MDOE expression, wherein the obtained MDOE simultaneously comprises size information of event data and density information of the event.

Further, in step S2, the sampling method is to set a fixed time interval when MOE is acquired, and sample at each fixed time interval to obtain a representation with dimensions of 2×c×h×w, where H and W are the spatial height and width of the event image data, respectively, and C is the number of time bins; the method is similar to the neural dynamics in a pulsed neuron, except that in a pulsed neuron, when the membrane potential exceeds the threshold of the neuron, the pulse is fired while the membrane potential is reset, whereas in this step the MOE acquisition has no firing process of the membrane potential and therefore the membrane potential is not reset, thus preserving all polarity information.

Further, in step S3, the event count over the whole event period is not directly considered in the process of acquiring the event density DOE, but the event count over each time window is used to generate a vector of size C for each pixel, and by doing this for each polarity, a DOE representation of size 2 xcxhxw is obtained.

Further, in step S1, let ε be an asynchronous event input sequence expressed as:

wherein x is _i Is the position (x for DVS camera _i ＝(x _i ，y _i ) Coordinates of the pixel that triggered the event), t _i Is the timestamp of event generation, p _i Is the polarity of the event, which takes two values: 1 and-1, respectively representing positive and negative events; i is the event number and the dynamics for impulse neurons are described as follows:

where u (t) represents the intimal potential of the neuron at time t,

is a weighted sum of the pre-neuron inputs, τ is a time constant, and over a period of time, the denser the events, the greater the membrane potential.

Further, in step S2 and step S3, the MOE and DOE are obtained by sampling the asynchronous event sequence, and it is assumed that E represents the MOE and V represents the DOE, which is described as follows:

k(a)＝max(0,1-|a|) (5)

t _n ＝t ₁ +(c _n +1)ΔT (8)

wherein ε is ₊ And epsilon-is a sequence of events having positive and negative polarity, respectively; k (a) is a bilinear kernel that mimics neuronal dynamics, 1 _ei Is an indicator function that ensures that only events between specified time periods are counted; (x) _l ，y _m ，c _n ) Is the space-time coordinates, x, on the voxel grid _l ∈{0,1···,W-1}，y _m E {0, 1. H-1}, and c _n E {0, 1. N-1}; Δt is the size of the time bin; to reduce the effect of sensor noise (e.g., different event numbers reporting accurate scenes at different times), the resulting MOE (E _± ) And DOE (V) _± ) Divided by their respective maximum values over the pixel and time bins, normalized.

Further, in step S4, the MDOE is a splice of the MOE and DOE along the polarity, denoted by S, i.e

S(p,x _l ,y _m ,c _n )＝Cat _± (E _± (x _l ,y _m ,c _n ),V _± (x _l ,y _m ,c _n )) (9)

Where Cat denotes the splice operation, p denotes the polarity of the MOE and DOE, p e {0,1, 2, 3}, where p=0 and p=1 denote the on-off polarity of the MOE, and p=2 and p=3 correspond to the on-off polarity of the DOE.

The invention has the beneficial effects that:

compared with the existing event-based representation method, the MDOE method provided by the invention contains more abundant information about the polarity, time and density of the event, and has two advantages: 1) It is a conceptual simple generic representation, task independent. 2) The present method achieves superior performance relative to existing representations on various event-based datasets.

Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objects and other advantages of the invention may be realized and obtained by means of the instrumentalities and combinations particularly pointed out in the specification.

Drawings

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in the following preferred detail with reference to the accompanying drawings, in which:

FIG. 1 is a schematic diagram of membrane potential changes of impulse neurons and a schematic diagram of MOE membrane potential changes of the same sequence;

FIG. 2 is a pseudo code schematic diagram of the proposed method of the present invention;

FIG. 3 is a schematic diagram of the time delay impact assessment on N-Caltech 101;

fig. 4 is a schematic diagram of the effect of using training data of different scales.

Detailed Description

The technical scheme of the invention is described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic diagram of membrane potential change of a pulse neuron and a schematic diagram of MOE membrane potential change of the same sequence, fig. 2 is a pseudo code schematic diagram of the method according to the present invention, fig. 3 is a schematic diagram of time delay influence evaluation on N-Caltech101, and fig. 4 is a schematic diagram of effect of training data using different proportions.

The data representation method based on the asynchronous event stream provided by the invention comprises the following steps: s1, inputting an asynchronous event stream as event data; s2, acquiring event size MOE; s3, acquiring an event density DOE; s4, splicing the MOE and the DOE to obtain MDOE expression, wherein the obtained MDOE simultaneously comprises size information of event data and density information of the event.

In step S2, the sampling method is to set a fixed time interval when MOE is acquired, and sample at each fixed time interval to obtain a representation with dimensions of 2 xcxhxw, where H and W are the spatial height and width of the event image data, respectively, and C is the number of time bins; the method is similar to the neural dynamics in impulse neurons, except that in impulse neurons when the membrane potential exceeds the threshold of the neuron, the impulse is triggered and the membrane potential is reset, whereas in this step the MOE acquisition has no triggering of the membrane potential and therefore the membrane potential is not reset, thus preserving all polarity information, as shown in fig. 1, fig. 1 (a) shows the dynamic change of the neuron membrane potential and fig. 1 (b) shows the MOE of the same event sequence.

In step S3, the event count over the whole event period is not directly taken into account in the process of obtaining the event density DOE, but rather the event count over each time window is used to generate a vector of size C for each pixel, and by doing this for each polarity, a DOE representation of size 2 xcxhxw is obtained.

As shown in fig. 2, let epsilon be an asynchronous event input sequence, expressed as:

where u (t) represents the intimal potential of the neuron at time t,

The MOE and DOE are obtained separately for asynchronous event sequence samples, assuming E represents MOE and V represents DOE, as follows:

k(a)＝max(0,1-|a|) (5)

t _n ＝t ₁ +(c _n +1)ΔT (8)

wherein ε is ₊ And epsilon _- A sequence of events having positive and negative polarity, respectively; k (a) is a bilinear kernel that mimics neuronal dynamics, 1 _ei Is an indicator function that ensures that only events between specified time periods are counted; (x) _l ，y _m ，c _n ) Is the space-time coordinates, x, on the voxel grid _l ∈{0,1···,W-1}，y _m E {0, 1. H-1}, and c _n E {0, 1. N-1}; Δt is the size of the time bin; to reduce the effect of sensor noise (e.g., different event numbers reporting accurate scenes at different times), the resulting MOE (E _± ) And DOE (V) _± ) Divided by their respective maximum values over the pixel and time bins, normalized.

MDOE is a splice of MOE and DOE along polarity, denoted by S, i.e

The new asynchronous event data representation method provided by the invention not only comprises the polarity and time information of the event, but also comprises the event density, and can better represent the information of asynchronous event stream data, in the embodiment, the method provided by the invention is evaluated through three disclosed event-based data sets (N-Caltech 101, N-Cars and EvTouch-Objects), and the following comparison is carried out in the evaluation process:

first, the results of the operation of the proposed method and the benchmark method of the present invention on two visual object recognition data sets (NCaltech 101 and N-Cars) are compared. Benchmark methods for comparison include HATS, matrix-LSTM, E2VID, event Frame), event Count Image, voxel Grid, and EST. All baseline methods use ResNet-34 as a classifier except E2VID (ResNet-18 is used). It can be seen from table 1 that the test accuracy of the present invention is much higher on both data sets than on all baseline methods.

Table 1N-Caltech 101 and N-Cars data set comparison of the results of the operation of the present invention with reference methods

Method	N-Caltech101 (variance)	N-Cars (variance)
			HATS	69.7	90.9
EST	81.7	92.5
			Matrix-LSTM	83.5(1.2)	92.2(0.7)
Event Frame	80.8(0.6)	93.4(0.5)
			Event Count Image	81.1(0.4)	93.5(0.6)
Voxel Grid	83.9(0.3)	92.9(0.6)
			MDOE (invention)	85.8(0.8)	94.2(0.4)
E2VD	86.6	91.0
			MDOE (invention)	88.2(0.4)	94.6(0.3)

Secondly, comparing the results of the operation of the present method with the baseline method on the Evtouch-Objects data set, since Evtouch-Objects data is collected using irregularly placed tactile sensors, it is not possible to directly process these data using convolutional neural networks. To utilize convolutional neural networks, where haptic data is organized into an 11×11 Grid structure, the baseline method of comparison includes TactileSGNet, event Frame, event Count Image, voxel Grid, and EST. All framework-like methods and proposed MDOEs use res net-34 as classifier and use the same configuration (e.g., training time, learning rate). It can be seen from table 2 that the proposed method is significantly better than all reference methods.

TABLE 2 comparison of the results of the operation of the present invention with the benchmark method on the EvTouch-Objects dataset

Method	Average accuracy (variance)
		TactileSGNet	89.4(0.6)
EventFrame	62.9(1.1)
		EventCountImage	62.9(1.6)
VoexlGrid	94.3(1.5)
		EST	93.1(1.6)
MDOE (invention)	95.8(1.1)

Comparing the results of the method with the baseline method on the time delay, which is the period of time spent accumulating evidence in making decisions on object classes, is critical for applications that require fast reaction times. The data set evaluated was N-Caltech101, with the benchmark methods being Event Frame, event Count Image and Voxel Grid. It can be seen from fig. 3 that the present invention is superior to all baseline methods, and that an accuracy of about 82% can be achieved using only the first 30 milliseconds of the sample. This makes the invention well suited for applications with high real-time requirements, such as autonomous navigation.

Finally, comparing the operation results of the method with the operation results of the reference method when training data with different proportions are used, wherein the data set is N-Caltech101, and the reference method comprises Event frames, event Count Image and Voxel Grid. The present invention and all benchmark methods use the same configuration (e.g., resNet-34 network, random seed, and learning rate) for fairness. As can be seen from fig. 4, the classification accuracy of all methods increases with increasing ratio of training data used, and the proposed method is always more accurate than the reference. It can be seen that the proposed method achieves an accuracy of about 63% when only 10% of the training data is used, much higher than the Voxel Grid (about 59%), event Count Image (55%) and Event Frame (53%). This means that the proposed method can better handle small data sets and this feature is very advantageous in situations where it is difficult to collect large amounts of data.

Although the comparative implementation is based on object classification datasets, the proposed method may also be used for other event-based applications such as visual odometry, image reconstruction and optical flow estimation.

In this embodiment, an Adam optimizer is used to train the model by minimizing cross entropy loss, the initial learning rate is set to 0.0001, the value decays 0.5 times every 10 iterations in training, and the total number of iterations is set to 200. The batch size for all data sets was set to 4, model multiple runs were performed on each data set using different random number seeds for robust evaluation, and mean and standard deviation values were reported.

Finally, an ablation study was performed on the proposed method, first evaluating the effect of each component of the proposed MDOE, first comparing the object recognition accuracy using MDOE, MOE and DOE, respectively. ResNet-34 is used as a classifier and all methods use the same configuration (e.g., training time, learning rate). All methods use a 9 time box. As can be seen from Table 3, MDOE performs better than MOE and DOE.

TABLE 3 comparison of test accuracy for MOE, DOE, MDOE three methods

Representation method	N-Caltech101 (variance)	N-Cars (variance)
			MOE	84.9(0.9)	93.5(0.5)
DOE	85.6(0.4)	92.9(0.5)
			MDOE	85.8(0.8)	94.2(0.4)

Specifically, MOE is more accurate than DOE on N-Cars, but less accurate on N-Caltech101 than DOE. The proposed method of combining MOE and DOE is superior to the method of using MOE and DOE alone on both data sets. The effect of different time bins on the performance of the method of the invention was analyzed, with the number of time bins varying from 1 to 64. Table 4 shows that the number of time bins has an effect on the proposed method.

Table 4 comparison of test accuracy on different number of time bins

Time box	N-Caltech101 (variance)	N-Cars (variance)
			1	83.4(0.4)	93.6(0.3)
9	85.8(0.8)	94.2(0.4)
			16	86.2(0.7)	94.1(0.4)
32	86.6(0.8)	94.2(0.4)
			64	86.4(0.6)	94.3(0.4)

When the time bin rises from 1 to 9, the accuracy achieved on N-Caltech101 increases from 83.4% to 85.8% and on N-Cars from 93.6% to 94.2%. The test accuracy continues to increase until the time bin of N-Caltech101 reaches 32. However, the situation is not the same for the N-Cars dataset, and the number of time bins after 9 has been reached has negligible effect on the proposed method. Next, the time bin was set to 9 to analyze the effect of other factors, and four methods proposed in the most advanced deep learning models, namely, res net-34, VGG-19, mobileNet-V2, and acceptance-V3, were evaluated. Table 5 shows the test accuracy on different network structures using the proposed method on the N-Caltech101 and N-Cars datasets. It can be seen that the accuracy of MobileNet-V2 and ResNet-34 is higher than VGG-19 and acceptance-V3 on both data sets.

TABLE 5 test accuracy on different models using MDOE

Model	N-Caltech101 (variance)	N-Cars (variance)
			VGG-19	81.9(0.8)	91.9(0.6)
Inception-V3	84.9(0.6)	93.6(0.8)
			MobileNet-V2	85.3(0.6)	94.8(0.8)
ResNet-34	85.8(0.8)	94.2(0.4)

Specifically, resNet-34 performs best on N-Caltech101, followed by MobileNet-V2 and acceptance-V3. In contrast, mobileNet V2 achieves the highest precision on N-Cars, followed by ResNet-34 and Incenpel-V3. VGG-19 achieves the lowest accuracy over both data sets.

Finally, it is noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified without departing from the spirit and scope of the technical solution, and all such modifications are included in the scope of the claims of the present invention.

Claims

1. A data representation method based on asynchronous event streams, characterized in that: the method comprises the following steps: s1, inputting an asynchronous event stream as event data; s2, acquiring event size MOE; s3, acquiring an event density DOE; s4, splicing the MOE and the DOE to obtain MDOE expression, wherein the obtained MDOE simultaneously comprises size information of event data and density information of the event;

in step S2, the sampling method is to set a fixed time interval when MOE is acquired, and sample at each fixed time interval to obtain a representation with dimensions of 2 xcxhxw, where H and W are the spatial height and width of the event image data, respectively, and C is the number of time bins;

in step S3, the event count over the whole event period is not directly considered in the process of obtaining the event density DOE, but rather the event count over each time window is used to generate a vector of size C for each pixel, and by doing this for each polarity, a DOE representation of size 2 xcxhxw is obtained;

in step S1, let epsilon be an asynchronous event input sequence expressed as:

wherein x is _i Is the position, t _i Is the timestamp of event generation, p _i Is the polarity of the event, which takes two values: 1 and-1, respectively representing positive and negative events; i is the event number and the dynamics for impulse neurons are described as follows:

where u (t) represents the intimal potential of the neuron at time t,

is a weighted sum of the pre-neuron inputs, τ is a time constant, at a segmentIn time, the denser the events, the greater the membrane potential;

in steps S2 and S3, the MOE and DOE are acquired by sampling the asynchronous event sequence, respectively, assuming that E represents the MOE and V represents the DOE, which is described as follows:

k(a)＝max(0，1-|a|) (5)

t _n ＝t ₁ +(c _n +1)ΔT (8)

wherein ε is ₊ And epsilon _- A sequence of events having positive and negative polarity, respectively; k (a) is a bilinear kernel that mimics neuronal dynamics, 1 _ei Is an indicator function that ensures that only events between specified time periods are counted; (x) _l ，y _m ，c _n ) Is the space-time coordinates, x, on the voxel grid _l ∈{0，1…，W-1}，y _m E {0,1, …, H-1}, c _n E {0,1, …, N-1}; Δt is the size of the time bin; to reduce the effect of sensor noise, the resulting MOE and DOE are normalized by dividing them by their respective maximum values over the pixel and time bins;

in step S4, the MDOE is a splice of the MOE and DOE along the polarity, denoted by S, i.e

S(p，x _l ，y _m ，c _n )＝Cat(E(x _l ，y _m ，c _n )，V(x _l ，y _m ，c _n )) (9)

2. A data representation system based on an asynchronous event stream, characterized by: the system employs the method of claim 1 for data representation of asynchronous event streams.