CN112712170A

CN112712170A - Neural morphology vision target classification system based on input weighted impulse neural network

Info

Publication number: CN112712170A
Application number: CN202110025992.3A
Authority: CN
Inventors: 赵广社; 姚满; 王鼎衡; 刘美兰
Original assignee: Xian Jiaotong University
Current assignee: Xian Jiaotong University
Priority date: 2021-01-08
Filing date: 2021-01-08
Publication date: 2021-04-27
Anticipated expiration: 2041-01-08
Also published as: CN112712170B

Abstract

The invention discloses a neural morphology visual target classification system based on an input weighted pulse neural network, which belongs to the technical field of artificial neural networks and comprises the following four method modules: the system comprises a data preprocessing module, a network construction module, a learning module and an inference module. The data preprocessing module is used for converting the acquired event camera asynchronous output space-time pulse event stream into an event frame sequence; the network construction module is used for constructing the input weighting unit and the impulse neural layer unit into an input weighting impulse neural network according to a certain network connection mode; the learning module is used for learning the input weighted impulse neural network obtained by the network construction module according to the event frame sequence obtained by the preprocessing module and generating a model file; the reasoning module is used for loading the network model file output by the learning module to perform feedforward network calculation. The invention can lead the neural morphology vision classification pulse neural network to have low time delay and keep higher performance.

Description

Neural morphology vision target classification system based on input weighted impulse neural network

Technical Field

The invention belongs to the field of deep learning in machine learning, and particularly relates to a neural morphology visual target classification system based on an input weighted impulse neural network.

Background

The event camera is an asynchronous neuromorphic visual sensor that produces a paradigm shift in the manner in which visual information is acquired. Different from the traditional vision sensor which samples light at a fixed moment, the event camera samples light according to scene dynamic, and generates a pulse event stream by asynchronously measuring the brightness change of each pixel point, wherein the pulse event stream encodes the time, the position and the brightness change polarity of the brightness change. The pulse Neural Networks (SNNs) are a new generation Artificial Neural network inspired by a brain operation mechanism and taking a pulse sequence as a data transmission form, and have the advantages of ultra-low time delay, low energy consumption and the like compared with the traditional Artificial Neural Networks (ANNs). Event camera high time resolution (microsecond level) event stream output, combined with an impulse neural network with ultra-low latency, has great potential in some computer vision application scenarios where efficiency and power consumption are required.

Theoretically, an Event-by-Event (Event-by-Event) processing mode is adopted for the pulse Event stream output by the neuromorphic visual sensor, and the internal state of the pulse neural network is changed due to the input of each Event, so that the output with the minimum time delay can be obtained. However, since the amount of information contained in a single event is extremely small, the output result at the lowest latency tends to be poor. Another approach is to somehow aggregate the pulse Event stream in a period of time into a new Event Frame (Event Frame), where the amount of information contained in the Event Frame is much larger than that of a single Event, which enables the network to obtain better performance, but introduces some delay. Therefore, new neuromorphic visual algorithms need to be researched and developed to achieve a balance between low latency and high performance.

Object recognition and classification is an important task of neuromorphic vision. The neural morphology visual target classification supervision learning method based on pulse frequency coding mainly comprises two methods:

(1) the conversion-based method: and converting the trained deep neural network (such as a convolutional neural network and a fully-connected neural network) into an impulse neural network by using a weight adjusting and normalizing method, and applying the impulse neural network to the neural morphology visual target classification. This approach can achieve comparable accuracy to deep neural networks, but has inherent limitations, such as limited use of activation functions and no utilization of temporal information in the event stream.

(2) Pulse-based methods: learning of the weights of synapses is performed by back-propagating the training errors in Time and space, and a representative algorithm is a Time-based back-propagation algorithm (BPTT). This approach has higher computational efficiency, but the accuracy in neuromorphic visual target classification cannot exceed the transform-based approach.

The main difficulty in solving the target classification and identification task of the neuromorphic vision lies in how to effectively extract the spatio-temporal characteristics from the pulse event stream spatio-temporal information by fully utilizing the pulse event stream spatio-temporal information so as to obtain higher classification precision. In the prior art, a pulse neural network based on a LIF neuron model updates weights by using a space-time error back propagation learning algorithm, so that a researcher can use a GPU (graphics processing unit) to perform accelerated training and can also use training tools such as a very mature Pythroch in deep learning. However, this method does not distinguish between the inputs to the network, which may affect the performance of the network to some extent. In fact, the inputs at different times contain different signal-to-noise ratios, giving the same input weight to all the input times weakens the network's ability to extract valid spatio-temporal features.

Disclosure of Invention

In order to overcome the defects of the prior art, the invention provides a neural morphological vision target classification system based on an input weighted impulse neural network, impulse event stream data is firstly aggregated into event frame sequence data, then an input weight is provided for each event frame according to the difference of the information content contained in the event frame, and finally the weighted event frame is used as a new input of the impulse neural network, so that the impulse neural network can adaptively focus on the event frame which has a great influence on a classification result in the event frame sequence, and the accuracy of a target classification and identification task of the neural morphological vision can be improved while only a small amount of data is needed.

In order to achieve the purpose, the invention provides the following technical scheme:

the neural morphology visual target classification system based on the input weighted pulse neural network comprises a data preprocessing module, a network construction module, a learning module and an inference module which are connected in sequence;

the data preprocessing module is used for acquiring a space-time pulse event stream asynchronously output by an event camera, and space-time pulse events in the space-time pulse event stream are described by adopting an address event expression protocol; the system is used for aggregating the spatio-temporal pulse event stream into an event frame sequence according to the time resolution dt' of an event camera, and the event frame sequence is described by tensor; aggregating the event frame sequences with the time resolution dt' into new event frame sequences according to the set time resolution dt, wherein the event frame data are described by tensors; the event frame sequence tensor data is used as the output of the data preprocessing module;

the network construction module consists of an input weighting unit and a pulse neural layer unit and is used for constructing an input weighting pulse neural network;

the learning module is used for learning the input weighted impulse neural network obtained by the network construction module according to the event frame sequence obtained by the preprocessing module and generating a model file;

and the reasoning module is used for reading the input weighted pulse neural network structure configured by the network construction module, loading the model file generated by the learning module to obtain the input weighted pulse neural network parameters to obtain a trained input weighted pulse neural network model, and taking a plurality of event frames output by the data preprocessing module as the input of the input weighted pulse neural network model to obtain a reasoning result.

The further improvement of the present invention lies in that the data preprocessing module aggregates the pulse event streams into an event frame sequence according to the set temporal resolution dt and according to the spatio-temporal pulse event streams output by the event camera, and specifically includes:

the spatiotemporal pulse event stream is composed of the set E ═ E { (E)_i|e_i＝[x_i,y_i,t′_i,p_i]Determining; wherein e_iFor the ith pulse event in the stream of pulse events, (x)_i,y_i) Is the pixel coordinate of the ith pulse event, t'_iTime stamp, p, in the entire time stream for the ith pulse event_iThe light intensity change polarity of the ith pulse event; the time resolution of the asynchronous pulse transmission event stream of the event camera is dt' and the spatial resolution is H multiplied by W; then, new spatio-temporal event frame data are aggregated according to the set time resolution dt.

The invention further improves that the data polymerization process is carried out in two steps, and specifically comprises the following steps:

first, several events E generated at time t 'are processed based on the temporal resolution dt' of the event camera_t′Assembled to tensor X_t′； wherein ,E_t′＝{e_i|e_i＝[x_i,y_i,t′,p_i]}，X_t′∈R^H×W×2；

Second, based on the set temporal resolution dt, using the formula X_t＝f(X′_t) Generating an event frame tensor X at time t_t∈ R^H×W×2(ii) a Wherein dt ═ β × dt', β is a polymerization time factor; x'_t＝{X_t′|t′∈[β×t,β×(t+1)-1]}; f is an accumulate operation, a weighted accumulate operation, or an AND or NOR operation.

The invention further improves that the network construction module consists of one or more input weighting units, a pulse neural layer unit and one or more perceptron neuron output layers; the pulse neuron adopts an LIF neuron model; the input weighting unit is composed of three steps, and specifically comprises:

firstly, an event frame X (X ═ X) obtained by a data preprocessing module is processed₁,X₂,...,X_t,...,X_T},X∈R^H ^×W×2×T) As an input, it is compressed into a vector z (z ═ z) using a compression function f₁,z₂,...,z_t,...z_T},z∈R^T)；Wherein, when the compression function f is an average pooling function, the t-th tensor X in X_tIs compressed into a value z_tThe concrete formula is as follows:

secondly, inputting the vector z into a two-layer nonlinear full-connection network to obtain an output vector s (s belongs to R)^T)：

s＝σ(W₂δ(W₁z))

Where, δ is the ReLU activation function,

weight matrix W₁ and W₂Is a trainable parameter, and r is an optional hyper-parameter;

thirdly, the value s in the vector s is measured_tAs event frame X_tMultiplying each element in the event frame by the weight of (2) to obtain a new event frame

As output of the input weighting unit:

the further improvement of the invention is that the learning module comprises a feedforward network computing unit, an error back propagation unit and a weight updating unit, and specifically comprises:

the feedforward network computing unit selects T event frames from an event frame sequence converted from single event stream data by using a random time clipping method as input of the feedforward network computing unit, and the T event frames are described by adopting a tensor; in particular, T generated at a temporal resolution dt in a single spatiotemporal pulse event stream data_totalRandomly extracting T event frames from the event frames as input to a feedforward network computing unit, T<T_total；

Calculating an output pulse sequence according to the input weighting unit, the pulse neural layer unit and the network connection mode in sequence, and then calculating an output target by the output layer of the perceptron neuron according to the output pulse sequence;

the error back propagation unit calculates the error between the output target and the set target according to the set loss function and performs back propagation;

the weight updating unit updates the weight according to the set learning rate and the set error.

The invention has the further improvement that the reasoning module consists of a model loading unit and a feedforward network computing unit; the model loading unit reads an input weighted impulse neural network structure configured by the network construction module and loads a model file generated by the learning module to obtain a trained input weighted impulse neural network model; the feedforward network computing unit is used for computing input weighting and a pulse neuron layer in the impulse neural network model according to the event stream or the event frame provided by the data preprocessing module, and then the perceptron neuron output layer obtains an output target according to output pulses of the pulse neuron layer.

Compared with the prior art, the invention has at least the following beneficial technical effects:

aiming at the problems of high time delay and low accuracy rate of classification of a pulse neural network used in the existing neural form visual target classification method, the invention improves the existing pulse neural network structure, namely, an input weighting unit is introduced to adaptively give higher weight to an event frame which has great influence on a neural form visual classification result, thereby improving the space-time characteristic extraction capability of pulse event streams, and effectively improving the neural form visual classification accuracy rate of the pulse neural network while reducing the required data volume.

Drawings

FIG. 1 is a diagram of a neuromorphic visual target classification system based on an input weighted impulse neural network according to the present invention.

Fig. 2 is an output of event stream data (waving motion) in the neuromorphic-visual data set DVS128 Gesture after passing through the data preprocessing module. That is, 200,000 microsecond pulsed event streams are aggregated into 10 event frames with a resolution of 20 milliseconds.

FIG. 3 is a network connection of an input weighted impulse neural network.

FIG. 4 is a schematic diagram of the input weighting unit.

FIG. 5 is a schematic diagram of a nerve layer unit consisting of impulses.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, the neural morphology visual target classification system based on the weighted impulse neural network provided by the invention is composed of a data preprocessing module S101, a network construction module S102, a learning module S103 and an inference module S104 which are connected in sequence.

The data preprocessing module S101 is composed of an impulse event stream aggregation unit and an event frame aggregation unit, and is configured to aggregate spatiotemporal impulse event streams asynchronously output by the event camera into an event frame sequence.

The input of the data preprocessing module is a space-time pulse event stream output by an event camera, and the space-time pulse event is described by adopting an address event expression protocol; the pulse event stream aggregation unit is used for aggregating the pulse event streams into an event frame sequence according to the time resolution dt' of the event camera; the event frame aggregation unit is used for aggregating an event frame sequence with the resolution dt' into new event frame sequence data according to the set time resolution dt, wherein the event frame data are described by tensors; and the event frame sequence output by the event frame aggregation unit is used as the output of the data preprocessing module.

The spatio-temporal pulse event stream is composed of the set E ═ E { (E) }_i|e_i＝[x_i,y_i,t′_i,p_i]Determining; wherein e_iFor the ith pulse event in the stream of pulse events, (x)_i,y_i) Is the pixel coordinate of the ith pulse event, t'_iTime stamp, p, in the entire time stream for the ith pulse event_iThe intensity change polarity for the ith pulse event. The temporal resolution of the asynchronous pulse event stream of the event camera is dt' and the spatial resolution is H x W. Typically, the temporal resolution of an event camera is in the order of microseconds, i.e., dt' is 1 μ s.

The pulse event stream aggregation unit is configured to aggregate pulse event streams into an event frame sequence according to a time resolution dt' of an event camera, and specifically includes: several events E generated at time t 'based on the temporal resolution dt' of the event camera_t′Assembled to tensor X_t′. wherein ,E_t′＝{e_i|e_i＝[x_i,y_i,t′,p_i]}，X_t′∈R^H×W×2。

The event frame aggregation unit is configured to aggregate an event frame sequence with a resolution dt' ═ 1 μ s into new event frame sequence data according to a set time resolution dt, and specifically includes: based on the set temporal resolution dt, using formula X_t＝f(X′_t) Generating an event frame tensor X at time t_t∈R^H×W×2. Wherein dt ═ β × dt', β is a polymerization time factor; x'_t＝{X_t′|t′∈[β×t,β× (t+1)-1]}; f may be an accumulation operation, a weighted accumulation operation, or an AND or NOR operation, etc.

Fig. 2 shows the output of the data preprocessing module when the input of a data (waving motion) in the data set DVS128 gettrue is a 200,000 microsecond stream of pulse events, the aggregation method is an or operation, and the aggregation time factor is 20,000. That is, fig. 2 shows 10 event frames with a resolution of 20 ms, which are aggregated by event streams.

When dt' ═ 1 μ s, β ═ 3, one example of an "or" polymerization method is as follows:

an example of the accumulation operation when dt ═ 1 μ s and β ═ 3 is as follows:

the network construction module S102 is composed of an input weighting unit and an impulse neural layer unit, and is configured to configure an input weighted impulse neural network. The network construction module consists of one or more input weighting units, a pulse nerve layer unit (a pulse nerve layer is formed by fully-connected or convoluted pulse neurons) and one or more sensor nerve cell output layers; the pulse neuron adopts a LIF neuron model. Fig. 3 shows a network connection formed by the input weighting unit and the pulse neuron layer unit.

The input weighting unit is shown in fig. 4. The unit is composed of three steps, and specifically comprises:

firstly, an event frame X (X ═ X) obtained by a data preprocessing module is processed₁,X₂,...,X_t,...,X_T},X∈R^H ^×W×2×T) As an input, it is compressed into a vector z (z ═ z) using a compression function f₁,z₂,...,z_t,...z_T},z∈R^T). Wherein the compression function f may be an average pooling function, a maximum pooling function, or the like. The t-th tensor X in X when f is the average pooling function_tIs compressed into a value z according to the following formula_t：

s＝σ(W₂δ(W₁z))

Where, δ is the ReLU activation function,sigma is a function of Sigmoid and is,

is a matrix of weights that can be trained,

is a trainable weight matrix and r is an optional parameter.

As output of the input weighting unit:

the spiking neural layer unit is composed of a LIF neuron model, as shown in fig. 5. Specifically, the LIF neuron layer expression is as follows:

wherein the function g is a step function,

denotes the Hadamard product, x^t,n-1Event frame, h, input for layer n-1 at time t^t-1,nIs the internal state quantity of the nth layer at the time t-1; u. of^t,nIs the membrane potential; the membrane potential is related to the neuron threshold u_thComparing and comparing x^t,nIs passed as spatial output to the next layer, h^t,nAs a temporal input of LIF neuronsThe output is passed to the next time. Current weight matrix

And spatial input x^t,n-1When matrix multiplication is adopted, the LIF neuron model is a full-connection LIF neuron model; current weight matrix

And spatial input x^t,n-1When convolution operation is adopted, the LIF neuron model is a convolution LIF neuron model.

The learning module S103 is composed of a feedforward network computing unit, an error back propagation unit and a weight updating unit, and is used for learning the input weighted impulse neural network obtained by the network construction module according to the event frame sequence obtained by the preprocessing module and generating a model file.

The feedforward network computing unit selects T event frames from an event frame sequence converted from single event stream data by using a random time clipping method as input of the feedforward network computing unit, and the T event frames are described by adopting a tensor; in particular, T generated at a temporal resolution dt in a single spatiotemporal pulse event stream data_totalAn event frame from which T (T) is randomly extracted< T_total) Taking the event frame as the input of a feedforward network computing unit; calculating an output pulse sequence according to the input weighting unit, the pulse neural layer unit and the network connection mode in sequence, and then calculating an output target by the output layer of the perceptron neuron according to the output pulse sequence;

and the error back propagation unit calculates the error between the output target and the set target according to the set loss function and performs back propagation.

The weight updating unit updates the weight according to the set learning rate and error.

The inference module S104 is composed of a model loading unit and a feedforward network computing unit. The device is used for reading an input weighted impulse neural network structure configured by the network construction module, loading a model file generated by the learning module to obtain input weighted impulse neural network parameters, obtaining a trained input weighted impulse neural network model, and taking an event frame output by the data preprocessing module as the input of the input weighted impulse neural network model to obtain a reasoning result.

And the model loading unit reads the impulse neural network structure configured by the network construction module and loads the model file generated by the learning module to obtain the trained input weighted impulse neural network model.

The feedforward network computing unit takes the event stream provided by the data preprocessing module as input, computes input weighting and a pulse neuron layer in the pulse neural network model, and then obtains an output target by the perceptron neuron output layer according to output pulses of the pulse neuron layer.

The data preprocessing module, the network construction module, the learning module and the reasoning module are connected in sequence.

To better illustrate the beneficial effects of the present invention, the following is a description of the experiment of the method of the present invention on the neuromorphic visual target classification dataset DVS128 Gesture. In the experiment, we set 60 event frames as input and test the influence of the input weighting unit on the final result on the basis of different time resolutions dt:

experimental results of convolution-based input weighted pulse neural network on DVS128 Gesture

As can be seen from the above table, when performing neuromorphic visual target classification, compared to the conventional impulse neural network, the input weighted impulse neural network can achieve higher classification performance while requiring only a small amount of data. For example, when the time resolution is 1 millisecond, only 60 milliseconds of data are required to achieve 91.28% accuracy. The input weighted impulse neural network can improve the network performance at all time resolutions.

Claims

1. The neural morphology visual target classification system based on the input weighted pulse neural network is characterized by comprising a data preprocessing module, a network construction module, a learning module and an inference module which are connected in sequence;

2. The neuromorphic visual target classification system based on an input weighted impulse neural network as claimed in claim 1, wherein the data preprocessing module aggregates the pulse event streams into a sequence of event frames according to a set temporal resolution dt from the spatiotemporal pulse event streams output by the event cameras, specifically comprising:

the spatiotemporal pulse event stream is composed of the set E ═ E { (E)_i|e_i＝[x_i，y_i，t′_i，p_i]Determining; it is composed ofIn (e)_iFor the ith pulse event in the stream of pulse events, (x)_i，y_i) Is the pixel coordinate of the ith pulse event, t'_iTime stamp, p, in the entire time stream for the ith pulse event_iThe light intensity change polarity of the ith pulse event; the time resolution of the asynchronous pulse transmission event stream of the event camera is dt' and the spatial resolution is H multiplied by W; then, new spatio-temporal event frame data are aggregated according to the set time resolution dt.

3. The neuromorphic visual target classification system based on an input weighted impulse neural network as claimed in claim 2, wherein the data aggregation process is performed in two steps, specifically comprising:

first, several events E generated at time t 'are processed based on the temporal resolution dt' of the event camera_t′Assembled to tensor X_t′； wherein ,E_t′＝{e_i|e_i＝[x_i，y_i，t′，p_i]}，X_t′∈R^H×W×2；

Second, based on the set temporal resolution dt, using the formula X_t＝f(X′_t) Generating an event frame tensor X at time t_t∈R^H ^×W×2(ii) a Wherein dt ═ β × dt', β is a polymerization time factor; x'_t＝{X_t′|t′∈[β×t，β×(t+1)-1]}; f is an accumulate operation, a weighted accumulate operation, or an AND or NOR operation.

4. The input-weighted spiking neural network-based neuromorphic visual target classification system of claim 1, wherein the network construction module is composed of one or more input weighting units and spiking neural layer units, one or more perceptron neuron output layers; the pulse neuron adopts an LIF neuron model; the input weighting unit is composed of three steps, and specifically comprises:

firstly, an event frame X (X ═ X) obtained by a data preprocessing module is processed₁，X₂，...，X_t，...，X_T}，X∈R^H×W×2×T) As an input, it is compressed into a vector z (z ═ z) using a compression function f₁，z₂，...，z_t，...z_T}，z∈R^T) (ii) a Wherein, when the compression function f is an average pooling function, the t-th tensor X in X_tIs compressed into a value z_tThe concrete formula is as follows:

s＝σ(W₂δ(W₁z))

Where, δ is the ReLU activation function,

As output of the input weighting unit:

5. the neuromorphic visual target classification system based on an input weighted impulse neural network as claimed in claim 1, wherein the learning module comprises a feedforward network computing unit, an error back propagation unit and a weight updating unit, and specifically comprises:

the feedforward network computing unit uses random timeSelecting T event frames from an event frame sequence converted from single event stream data as the input of a feedforward network computing unit by an interval cutting method, wherein the T event frames are described by adopting tensors; in particular, T generated at a temporal resolution dt in a single spatiotemporal pulse event stream data_totalRandomly extracting T event frames from the event frames as input of a feedforward network computing unit, wherein T < T_total；

6. The neuromorphic visual target classification system based on an input weighted impulse neural network as claimed in claim 1, wherein the inference module is composed of a model loading unit and a feedforward network computing unit; the model loading unit reads an input weighted impulse neural network structure configured by the network construction module and loads a model file generated by the learning module to obtain a trained input weighted impulse neural network model; the feedforward network computing unit is used for computing input weighting and a pulse neuron layer in the impulse neural network model according to the event stream or the event frame provided by the data preprocessing module, and then the perceptron neuron output layer obtains an output target according to output pulses of the pulse neuron layer.