CN113378917A

CN113378917A - Event camera target identification method based on self-attention mechanism

Info

Publication number: CN113378917A
Application number: CN202110640443.7A
Authority: CN
Inventors: 张世雄; 魏文应; 李楠楠; 傅弘; 龙仕强
Original assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Current assignee: Instritute Of Intelligent Video Audio Technology Longgang Shenzhen
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-09-10
Anticipated expiration: 2041-06-09
Also published as: CN113378917B

Abstract

A method of event camera target recognition based on a self-attention mechanism, comprising the steps of: s1, initializing an event camera; s2, completing a data acquisition task by using the initialized event camera; s3, performing imaging conversion on the collected event camera data to enable the event camera data to be used for a target recognition task; s4, extracting features of the event camera data after the imaging conversion by using the trained network to obtain depth features of the target; s5, inputting the extracted depth features into a self-attention mechanism model for self-attention calculation to obtain a target type contained in the event camera data at the current moment; and S6, outputting the result of the feature subjected to the self-attention calculation, and taking the result with the highest confidence level as a final result for outputting. The method solves the technical problems that the current event camera training data is insufficient, the traditional deep learning method is weak in overall perception capability of the event camera data, and the like.

Description

Event camera target identification method based on self-attention mechanism

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an event camera target identification method based on a self-attention mechanism.

Background

The target recognition technology has been widely applied to various fields of our daily life, and the traditional target recognition is based on the traditional RGB camera for recognition. In the past decades, deep learning techniques have become popular and have become a popular identification technique. However, as technology is continuously applied, a scheme for performing identification based on a conventional RGB camera has certain defects, for example, the conventional RGB camera acquires external world data in a series of frame pictures, a large amount of information redundancy exists between consecutive frames, and some key information is easily lost between adjacent frames due to a relatively fixed refresh frequency of an event of frame acquisition. Meanwhile, with the popularization of the traditional RGB camera, bottlenecks in memory and transmission become more and more obvious, and in the target recognition method based on deep learning, a large amount of calculation power needs to be consumed, which leads to the decrease of real-time performance of some applications.

The event camera is a different traditional RGB camera vision collection mode, and is a nerve mimicry vision sensor which is inspired and designed by human retina. It is based on an event-driven approach to capture changes in the outside world. In the event camera, there is no update method of the frame picture. When the scene of the outside world changes, the event camera will perform a series of pixel level updates, and for the part without change, it will not perform the update. The pixel of the event camera includes four parts of information which can be represented as (x, y, t, p), wherein (x, y) represents the two-dimensional coordinates of the event pixel, t represents the time stamp of the pixel, p represents the polarity of the pixel, and the polarity reflects the brightness change of the pixel and has two trends of rising and falling. Due to this updated mode of the event camera, the amount of data of the event camera is very small, while less resources are required for data storage and data processing.

The attention mechanism is a neural network model designed based on the attention of the human brain. The method can effectively extract key information in network training. The attention mechanism can be used for hierarchically distinguishing local information and global information through a certain part of dynamic attention features, and for an important part influencing a result, weight distribution is effectively carried out, so that the accuracy of the result can be improved.

Disclosure of Invention

The invention aims to provide an event camera target recognition method based on an attention mechanism, and solves the technical problems that the current event camera training data is insufficient, the traditional deep learning method is weak in overall perception capability of the event camera data, and the like.

The technical scheme of the invention is as follows:

the invention discloses a method for recognizing an event camera target based on a self-attention mechanism, which comprises the following steps of: s1, initialization: initializing the event camera, wherein different data initialization schemes are adopted for different event camera data types; s2, data acquisition: completing a data acquisition task by utilizing the initialized event camera; s3, data conversion: carrying out imaging conversion on the acquired event camera data so that the event camera data can be used for a target recognition task; s4, feature extraction: extracting features of the event camera data after the imaging conversion by using a trained network to obtain depth features of the target; s5, self-attention calculation: inputting the extracted depth features into a self-attention mechanism model for self-attention calculation to obtain a target type contained in the event camera data at the current moment; and S6, outputting a result: and outputting the result of the feature subjected to the self-attention calculation, and taking the result with the highest confidence as a final result for outputting.

Preferably, in the above method for object recognition of an event camera based on the self-attention mechanism, in step S1, the event camera is initialized in two ways: one mode is to identify a moving target, adopt a fixed camera to acquire data of the moving target, and shield a camera before the first frame is acquired so as to ensure that the initialization of an event camera is not interfered by other background information; the other mode is to collect non-moving static targets, an event camera is fixed on a moving unmanned aerial vehicle, the static targets are collected by utilizing the principle of relative motion, and the operation can also collect moving target data.

Preferably, in the above method for recognizing an event camera target based on a self-attention mechanism, in step S2, data to be detected is collected and transmitted to the data conversion module in real time.

Preferably, in the above method for identifying an event camera target based on a self-attention mechanism, in step S3, the event camera data collected in step S2 is subjected to imaging conversion, where the conversion method is to use an event t to intercept and store a pixel cross section at the same time t, each event pixel a (x, y, t, p) of the cross section has a uniform time t, the size of the pixel value is obtained by normalizing the image p, and the range of the value is 0-255, so as to obtain a sparse image data with a clear target contour.

Preferably, in the above method for recognizing an event camera target based on a self-attention mechanism, in step S4, the event camera data obtained in step S3 is input into a depth model pre-trained by a large amount of RGB images and trained by a small amount of event camera data to extract a depth feature effective for the target, and the depth feature retains rich target contour structure information.

Preferably, in the above method for identifying an event camera target based on a self-attention mechanism, in step S5, the method for calculating the self-attention mechanism is to weight the depth features first, and the method for weighting is to multiply the features by three different weight matrices, i.e., W1, W2, and W3, and then perform pairwise operation on the weighted feature matrices to obtain a result.

According to the technical scheme of the invention, the beneficial effects are as follows:

the method utilizes an attention mechanism to identify the data collected by an event camera and identify the moving target collected by the event camera; a special characteristic training and extracting mode is designed for a special data expression form of the event camera. The method can use the minimum processing cost to perform target recognition, and then provides a pre-training method for learning structural features by using the existing RGB data aiming at the problem of insufficient training data of the event camera, so that the method for training the data of the event camera effectively improves the generalization expression capability of the model; the method has the advantages that the self-attention mechanism is utilized to carry out associated learning on the trained event camera features, and finally, the effective event camera data recognition network is trained, so that the application field of the event camera is expanded.

For a better understanding and appreciation of the concepts, principles of operation, and effects of the invention, reference will now be made in detail to the following examples, taken in conjunction with the accompanying drawings, in which:

drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below.

FIG. 1 is a flow chart of a method of event camera target recognition based on a self-attention mechanism of the present invention; and

fig. 2 is a data diagram collected by an event camera.

Detailed Description

The present invention will be described in further detail below with reference to specific embodiments in conjunction with the accompanying drawings.

The invention discloses a method for identifying an event camera target based on a self-attention mechanism, and relates to a moving target identification technology of an event camera utilizing the self-attention mechanism. Specifically, for data collected by an event camera, a neural network with an attention mechanism is used for carrying out target detection, identification and calculation, and finally, a target object in the data is identified. According to the method, the data of the event camera to be recognized is input into a trained self-attention mechanism model, and the self-attention model can effectively perform recognition detection. The method provided by the invention is a method for intelligently identifying the targets acquired by the event camera by using a self-attention mechanism, can effectively identify various targets captured by the event camera, and can effectively improve the stability and reliability of target identification.

The principle of the method of the invention is as follows: the target recognition technology based on the self-attention mechanism is used, and the technology can be effectively applied to the target recognition of the event camera by improving the training and feature extraction mode of the technology. The improved strategy is to learn the structural characteristics of target recognition by utilizing rich RGB images for pre-training, and the structural characteristics have rich expression capacity on the object outline. Then, the learned features are retrained by a small amount of event camera data, so that the network can adapt to the data characteristics of the event cameras, and at the moment, the network has learned strong target structural features. After the effective features are learned, the target features are subjected to associated learning through an attention mechanism, finally, the similarity score of the target is obtained, and the target object belongs to the highest score.

As shown in fig. 1, the technique of the method of the present invention comprises the following steps:

s1, initialization: event cameras are initialized, with different data initialization schemes being employed for different event camera data types.

In this step, the event camera is initialized in two ways: one mode is to identify a moving target, adopt a fixed camera to acquire data of the moving target, and shield a camera before the first frame is acquired so as to ensure that the initialization of an event camera is not interfered by other background information; the other mode is to collect non-moving static targets, fix an event camera on a moving unmanned aerial vehicle, and collect the static targets by using the principle of relative motion, although the operation can also collect moving target data.

S2, data acquisition: and completing the data acquisition task by utilizing the initialized event camera.

After the setting of step S1, data to be detected is collected in step S2, and the data is transmitted to the data conversion module in real time.

S3, data conversion: and performing imaging conversion on the acquired event camera data, namely performing imaging operation on the event camera, so that the event camera data can be used for a target recognition task.

The event camera data collected in step S2 is converted into an image by capturing and storing a section of a pixel at the same time t using the event t. Each event pixel a (x, y, t, p) of the section has a uniform time t, the size of the pixel value obtains a value according to the normalization of the image p, the value range is (0-255), and a sparse image data with a clear target contour is obtained.

As shown in fig. 2, the present invention uses the visualization effect obtained by processing the imaging processing method of the event camera in step S3 after acquiring the data of the vehicle (object) in step S2, and the acquisition processing method and effect for other objects also follow this exemplary effect.

S4, feature extraction: and extracting the features of the event camera data after the imaging conversion by using the trained network to obtain the depth features of the target.

The event camera data obtained in step S3 is input into a depth model (i.e., a trained network) pre-trained with a large number of RGB images and trained with a small number of event camera data to extract a target valid depth feature, which retains rich target contour structure information.

S5, self-attention calculation: and inputting the extracted depth features into a self-attention mechanism model for self-attention calculation. And calculating the target type contained in the event camera data at the current moment through the calculation of the self-attention mechanism model.

The depth feature of the target obtained in step S4 is subjected to attention mechanism calculation, the calculation method is to weight the depth feature first, the weighting method is to multiply the feature by three different weight matrices, namely W1, W2 and W3, and then to perform pairwise operation on the weighted feature matrices to obtain a result.

S6, outputting a result: and outputting the result of the feature subjected to the self-attention calculation, and taking the result with the highest confidence as a final result for outputting.

The method utilizes the self-attention mechanism to the global information so as to solve the problem of identification based on the event camera data, designs a set of methods from the acquisition mode of the event camera data, the imaging processing mode of the event camera, the characteristic learning mode of the self-attention mechanism depth model and the training mode of a small amount of event cameras, and effectively solves the problems that the event camera cannot identify a target, the small amount of event camera data cannot be effectively trained and the like.

The foregoing description is of the preferred embodiment of the concepts and principles of operation in accordance with the invention. The above-described embodiments should not be construed as limiting the scope of the claims, and other embodiments and combinations of implementations according to the inventive concept are within the scope of the invention.

Claims

1. A method for event camera target recognition based on a self-attention mechanism, comprising the steps of:

s1, initialization: initializing the event camera, wherein different data initialization schemes are adopted for different event camera data types;

s2, data acquisition: completing a data acquisition task by utilizing the initialized event camera;

s3, data conversion: carrying out imaging conversion on the acquired event camera data so that the event camera data can be used for a target recognition task;

s4, feature extraction: extracting features of the event camera data after the imaging conversion by using a trained network to obtain depth features of the target;

s5, self-attention calculation: inputting the extracted depth features into a self-attention mechanism model for self-attention calculation to obtain a target type contained in the event camera data at the current moment; and

2. The method for event camera target recognition based on the self-attention mechanism as claimed in claim 1, wherein in step S1, the event camera is initialized by two ways: one mode is to identify a moving target, adopt a fixed camera to acquire data of the moving target, and shield a camera before the first frame is acquired so as to ensure that the initialization of an event camera is not interfered by other background information; the other mode is to collect non-moving static targets, an event camera is fixed on a moving unmanned aerial vehicle, the static targets are collected by utilizing the principle of relative motion, and the operation can also collect moving target data.

3. The method for event camera target recognition based on self-attention mechanism as claimed in claim 1, wherein in step S2, the data to be detected is collected and transmitted to the data conversion module in real time.

4. The method for event camera target recognition based on self-attention mechanism as claimed in claim 1, wherein in step S3, the event camera data collected in step S2 is converted into image, the conversion method is to use the event t to intercept and store the pixel cross section at the same time t, each event pixel a (x, y, t, p) of the cross section has a uniform time t, the size of the pixel value is obtained by the normalization of the image p, the value range is 0-255, and a sparse image data with a clear target contour is obtained.

5. The method for event camera target recognition based on self-attention mechanism as claimed in claim 1, wherein in step S4, the event camera data obtained in step S3 is input into a depth model pre-trained by a large amount of RGB images and trained by a small amount of event camera data to extract the target valid depth features, and the depth features retain rich target contour structure information.

6. The method for event camera target recognition based on self-attention mechanism as claimed in claim 1, wherein in step S5, the calculation method of the attention mechanism is to weight the depth features first, the weighting method is to multiply the features by three different weight matrices, i.e. W1, W2 and W3, respectively, and then to operate the weighted feature matrices two by two to obtain the result, the self-attention mechanism can effectively capture the global information, because it operates two by two, and thus can process in parallel.