CN114862732B

CN114862732B - Synthetic aperture imaging method integrating event camera and traditional optical camera

Info

Publication number: CN114862732B
Application number: CN202210422694.2A
Authority: CN
Inventors: 余磊; 廖伟; 张翔; 王阳光; 杨文�
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2024-04-26
Anticipated expiration: 2042-04-21
Also published as: CN114862732A

Abstract

The invention provides a synthetic aperture imaging method for fusing an event camera with a traditional optical camera, which fuses the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, and reconstructs a high-quality non-shielding target image frame to finish high-quality penetrating imaging tasks in shielding scenes with various densities. The invention comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning ability of the neural network, thereby realizing the high-quality image reconstruction of the target in various dense shielding scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.

Description

Synthetic aperture imaging method integrating event camera and traditional optical camera

Technical Field

The invention belongs to the field of image processing, and particularly relates to a synthetic aperture imaging method for fusing an event camera with a traditional optical camera.

Background

Synthetic aperture imaging (SYNTHETIC APERTURE IMAGING, SAI) techniques use cameras to make multi-view observations of a scene, thereby equating to one large aperture virtual camera. Because the larger the aperture of the camera is and the smaller the depth of field is, when the shooting target is shielded, the imaging of the shielded target can be realized by blurring the shielding object by the synthetic aperture imaging, so that the method has extremely high application value in the fields of three-dimensional reconstruction, target tracking, identification and the like.

Current synthetic aperture imaging algorithms mainly use a sequence of multi-view image frames taken by a conventional optical camera for perspective imaging. However, as the obscuration becomes denser, the target light information contained in the image frames decreases dramatically, and the interfering light information of the obscuration dominates, resulting in serious performance degradation of the traditional optical camera-based synthetic aperture imaging algorithm. In recent years, a synthetic aperture imaging algorithm based on event cameras has been proposed to solve the perspective imaging problem in dense occlusion scenes. The event camera senses the brightness change of the pixel point in the logarithmic domain, asynchronously outputs event stream data with high time resolution and high dynamic range, and can sense the target approximately continuously, so that sufficient target information can be obtained in a dense shielding scene. Because the existing synthetic aperture imaging algorithm of the event camera images event points generated based on the brightness difference of the target and the shielding object, when in a sparse shielding scene, the effective event points are reduced, so that the performance of the synthetic aperture imaging algorithm based on the event camera is degraded. Considering that the traditional optical camera and event camera based synthetic aperture imaging algorithm have better performance in sparse shielding scenes and dense shielding scenes respectively, the information of the two can be fully utilized for fusion imaging, and synthetic aperture imaging in various dense shielding scenes is realized. However, since the data modality of the event stream is completely different from that of the image frame data, there is still a great difficulty in how to establish a bridge between the two to realize fusion imaging.

Disclosure of Invention

Aiming at the problems, the invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which combines the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, and reconstructs a high-quality non-occlusion target image frame to finish high-quality penetration imaging tasks in multiple intensive occlusion scenes.

The invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which comprises the following specific steps:

step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing the image frame data set in a non-occlusion scene.

Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;

Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, performing pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network as a training set to obtain a target reconstructed image of network prediction, and combining a non-occlusion image of the target and a network learning loss function, and performing iterative optimization on network parameters through an ADAM optimizer to obtain the trained hybrid neural network;

the hybrid neural network model described in step3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module;

the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;

The cross-attention enhancement module comprises a plurality of cross-attention transducer modules for enhancing the extracted features for a plurality of times;

the density sensing module comprises a feature cascade of enhanced features and original features, then a convolution layer, global average pooling, global maximum pooling and a full connection layer are adopted, and finally weighted summation is carried out to output the fused features;

the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing images according to the fused characteristics;

and 4, inputting the event stream of the blocked target to be reconstructed and the image frame into the trained network for prediction to obtain a de-blocked reconstructed image corresponding to the blocked target.

Further, the multi-view event stream data set E in step 1 is:

E＝{e_k＝(x_k,y_k,p_k,t_k)},k∈[1,K],x_k∈[1,W],y_k∈[1,H],p_k∈[1,-1]

Wherein e _k is the kth event point data, x _k,y_k is the pixel coordinates of the kth event point, and p _k is the event polarity (polarity 1 indicates that the luminance increases and polarity-1 indicates that the luminance decreases); t _k is the timestamp of the event point; w, H respectively represent the width and height of the space coordinates of the event points, and K represents the number of the event points.

Further, the multi-view image frame data set F in step 1 is:

Wherein I _n denotes nth image frame data, W, H denote width and height of an image frame, respectively, and N denotes the total number of image frames. Further, the mapping of the multi-view event stream image frame to the reference camera position in step 2 is specifically as follows:

[x^r,y^r,1]＝KRK^-1[x,y,1]+KT/d (1)

In the formula (1), x ^r,y^r represents the mapped image coordinates, x, y represents the image coordinates, R, T represents a rotation and translation matrix from the pixel point to the reference camera position, K represents an internal reference matrix of the camera, d is a refocusing distance, and generally is set as a distance from the blocked target to the camera plane. According to the mapping formula, the refocused event stream data set E ^r can be obtained:

wherein, K-th event point after refocusing,/>Representing the x, y axis coordinates corresponding to the event point,/>For event polarity (polarity 1 representing the increase in the light intensity and polarity-1 representing the decrease in the light intensity); /(I)Is the timestamp of the event point. From the mapping formula (1), the refocused image frame data set F ^r can be obtained:

wherein, Represented as the n Zhang Chong th focused image frame.

Further, the refocusing event stream framing procedure described in step 3 is expressed as:

In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:

Where t _k is the timestamp of the kth event point and t ₁ is the timestamp of the first event point. Thus, a refocusing event frame data set F ^E,r after the frame compression can be obtained:

further, the event stream pre-reconstruction process described in step 3 is expressed as:

In the formula (3), The J is represented by a J-th pre-reconstruction event frame, J is the total number of pre-reconstruction event frames, x and y represent image coordinates herein, x _k,y_k is the image coordinates of a k-th event point, delta (·) is represented by a dirichlet function, recon (·) is represented by an event stream brightness reconstruction operator, a currently mainstream event stream reconstruction algorithm is generally used, and delta t is represented by the time length adopted by a single event frame compression frame. Next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set F ^E→F,r after refocusing is obtained:

wherein, A pre-reconstructed event frame focused for the j Zhang Chong th.

Further, the hybrid neural network model in step 3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module. Refocus event frame F ^E,r after network input frame compression, refocus Jiao Tu frame F ^r and refocus event stream pre-reconstruction event frame F ^E→F,r are firstly input into a multi-mode coding module by three branches for coarse feature extraction

f_F,0,f_E,0,f_E→F,0＝MF-Encoder(F^r,F^E,r,F^E→F,r)

Wherein F _F,0,f_E,0,f_E→F,0 represents the coarse features extracted from F ^r,F^E,r,F^E→F,r, and MF-Encoder (·) represents the multi-mode encoding operation, which performs feature extraction on the three signals, respectively. For both signals of F ^r,F^E→F,r, feature extraction was performed using a three-layer convolutional layer architecture containing a jump connection. For the F ^E,r signal, feature extraction was performed using a three-layer pulse layer containing jump links.

Next, the features are enhanced using a cross-attention enhancement module:

f_F,M,f_E,M,f_E→F,M＝CME(f_F,0,f_E,0,f_E→F,0)

Wherein, f _F,0,f_E,0,f_E→F,0 is three-way characteristics output by the multi-mode coding module in the last step, f _F,M,f_E,M,f_E→F,M is three-way characteristics after cross-attention enhancement, CME (·) is cross-mode enhancement operation, and the characteristics are enhanced for a plurality of times by a transducer module containing M cross-attention, and the process is expressed as:

f_F,m,f_E,m,f_E→F,m＝C-Transformer(f_F,m-1,f_E,m-1,f_E→F,m-1),m∈[1,M]

Wherein f _F,m-1,f_E,m-1,f_E→F,m-1 is three-way characteristics input into the mth cross-attention transducer module, f _F,m,f_E,m,f_E→F,m is three-way characteristics output, C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, the operation is performed on each path of signals respectively, the normalization is performed on each path of signals, the respective attention information is calculated, the cross-modal information enhancement is realized by adding and fusing the attention information of the three paths of signals, the characteristic enhancement is performed by utilizing a multi-layer perceptron, and the enhanced characteristic f _F,m,f_E,m,f_E→F,m is output.

Next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:

f_ALL＝DAF(f_F,M,f_E,M,f_E→F,M,F^r,F^E,r,F^E→F,r)

Wherein F _F,M,f_E,M,f_E→F,M is three enhanced features output in the previous step, F ^r,F^E,r,F^E→F,r is three original signals input by a network, F _ALL is a feature output after fusion, DAF (-) is a density perception fusion operation, the input features are cascaded on feature dimensions with corresponding original inputs, the features are input into a single-layer convolution layer for feature extraction, then a global average pooling and a global maximum pooling are used for extracting an average value and a maximum value of the three features respectively, the three features are input into a full-connection layer for calculating a weight value, finally weighted summation is carried out on the three signals, and the fused features F _ALL are output.

Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:

I_recon＝MF-Decode(f_ALL)

Wherein f _ALL is the fused feature output by the previous module, I _recon is the reconstructed luminance image, MF-Encoder (·) is a multi-mode feature decoding operation, which generally uses a mainstream convolutional neural network architecture.

Further, the learning loss function in step 3The definition is as follows:

I _recon in the formula (4) is a brightness image reconstructed by the network, I _gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), the above-mentioned loss functions are all known functions commonly used at present, and beta _per,β_L1,β_per is expressed as the weight of the corresponding loss respectively.

Furthermore, the input event stream and the image frame data in the step 4 need to be subjected to the same refocusing process as that in the step 2, then the same event stream preprocessing process in the step 3 is performed, and then the trained neural network is input, so that a corresponding target reconstruction image can be obtained.

Compared with the prior art, the invention has the advantages that:

The invention provides a synthetic aperture imaging method for fusing an event camera and a traditional optical camera, which comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning ability of a neural network, thereby realizing high-quality image reconstruction of a target in various dense shielding scenes and further enhancing the robustness and applicability of the synthetic aperture imaging technology.

Drawings

FIG. 1 is a schematic diagram of an experimental scenario including an occluded object, an occlusion object, a programmable sled, and a camera sensor.

FIG. 2 is a flow chart of the overall process of the present invention.

Fig. 3 is a schematic diagram of a camera moving data acquisition process.

Fig. 4 is a contrast diagram of an image frame, an event frame and a pre-reconstruction event frame after data preprocessing.

Fig. 5 is a schematic diagram of a hybrid neural network structure.

FIG. 6 is a graph comparing the present method with different synthetic aperture imaging methods. And (3) the synthetic aperture imaging results of densely-occluded scenes with different behaviors from top to bottom. The first column from left to right is an occlusion diagram, the second column is an occlusion-free reference image, the third column is a synthetic aperture imaging algorithm (F-SAI+ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI+CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI+ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI+hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.

Detailed Description

In order to more clearly understand the present invention, the technical contents of the present invention will be more clearly and completely described in the following description of an example in conjunction with fig. 1. It is apparent that the described examples are only some, but not all, of the embodiments of the invention. All other examples, based on examples in this invention, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the invention.

The specific facts of the invention are a synthetic aperture imaging method which fuses an event camera with a traditional optical camera.

A specific implementation schematic scene of the invention is shown in figure 1 of the accompanying drawings, and comprises the following steps: the camera sensor, the programmable slide rail, shelter from thing and sheltered from the goal;

The camera sensor model is a Davis346 event camera, and the camera can synchronously output event stream and image frame data and is used for constructing a corresponding data set;

the Davis346 event camera is fixed on a programmable slide rail which is set to move linearly at a constant speed;

the whole flow chart of the invention is shown in figure 2 of the accompanying drawings, and the specific steps are as follows:

step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing an image frame data set in a non-occlusion scene, as shown in fig. 3;

The multiview event stream data E described in step 1 is

E＝{e_(k})＝(x_k,y_k,p_k,t_k)},k∈[1,K],x_k∈[1,W],y_k∈[1,H],p_k∈[1,-1]

Wherein e _k is the kth event point data, x _k,y_k is the pixel coordinates of the kth event point; p _k is the event polarity (polarity 1 represents the increase in the light intensity, and polarity-1 represents the decrease in the light intensity); t _k is the timestamp of the event point; w=346, h=260 represents the width and height of the event point space coordinates, respectively, and K represents the number of event points.

The multi-view image frame data F in step 1 is:

I _n is the nth image frame data, w=346, h=260 is the width and height of the image frame, and n=30 is the total number of image frames.

the method for mapping the multi-view event stream image frame to the reference camera position in the step 2 is specifically as follows:

[x^r,y^r,1]＝KRK^-1[x,y,1]+KT/d (1)

In formula (1), x ^r,y^r represents the mapped image coordinates, R, T represents the rotation and translation matrix from the pixel point to the reference camera position, K represents the internal reference matrix of the camera, d is the focusing distance, and generally the distance from the blocked target to the camera plane is set. In this routine, since the camera movement mode is set to be uniform linear movement, it can be considered that the clock is not rotated during the photographing process of the camera, and thus the rotation matrix is a pair of angular unit matrices. At time t, the translation matrix calculation mode can be modeled as

T_t＝[v_track(t-t_r)0 0]

Wherein v _track = 0.0885m/s is the movement speed of the guide rail, and t _r is the time stamp of the camera at the reference position when each shooting.

From the mapping formula (1), refocused event stream data E ^r can be obtained:

wherein, For the k-th event point after focusing,/>Representing the x, y-axis coordinates/>, corresponding to the event pointFor event polarity (polarity 1 representing the increase in the light intensity and polarity-1 representing the decrease in the light intensity); /(I)Is the timestamp of the event point. From the mapping formula (1), the refocused image frame data set F ^r can be obtained:

wherein, Represented as the nth refocused image frame.

Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, carrying out pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM optimizer by combining a non-occlusion image of the target and a network learning loss function to obtain a trained hybrid neural network;

the refocusing event stream framing procedure described in step 3 is expressed as:

The process of pre-reconstructing and refocusing the event stream described in step 3 is expressed as:

In the formula (3), Expressed as J-th pre-reconstructed event frame, J is the total number of pre-reconstructed event frames, x, y represent image coordinates here, x _k,y_k are the image coordinates of the kth event point, delta (·) is expressed as dirichlet function, recon (·) is expressed as event stream brightness reconstruction operator, the existing E2VID algorithm is used in this routine, and Δt is expressed as the time length adopted by the single event frame compression frame. Next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set after refocusing is obtained:

wherein, And (3) the j-th focused pre-reconstructed event image.

The refocused image frames, event frames, and pre-reconstructed event frames are each visually illustrated in fig. 4 of the drawings.

The structure of the hybrid neural network in the step 3 is shown in fig. 5 of the accompanying drawings. The hybrid neural network model described in step 3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module. Refocus event frame F ^E,r after frame compression input by the network, refocus Jiao Tu frame F ^r and refocus event stream pre-reconstruction event frame F ^E→F,r are input into a multi-mode coding module by three branches for coarse feature extraction:

f_F,0,f_E,0,f_E→F,0＝MF-Encoder(F^r,F^E,r,F^E→F,r)

Wherein F _F,0,f_E,0,f_E→F,0 represents the coarse features extracted from F ^r,F^E,r,F^E→F,r, and MF-Encoder (·) represents the multi-mode encoding operation, which performs feature extraction on the three signals, respectively. For both signals F ^r,F^E→F,r, a three-layer convolution layer architecture including jump connection is used for feature extraction, and in this embodiment, the convolution kernels of the three-layer convolution layer are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32. For the F ^E,r signal, feature extraction is performed using three pulse layers including jump links, and in this embodiment, the pulse kernel sizes of the three pulse layers are 1, 3, and 7, and the pulse kernel numbers are 8, 16, and 32.

Next, the features are enhanced using a cross-attention enhancement module:

f_F,M,f_E,M,f_E→F,M＝CME(f_F,0,f_E,0,f_E→F,0)

Wherein, f _F,0,f_E,0,f_E→F,0 is three-way characteristics output by the multi-mode coding module in the last step, f _F,M,f_E,M,f_E→F,M is three-way characteristics after cross-attention enhancement, CME (·) is cross-mode enhancement operation, which includes multiple enhancement of the characteristics by m=6 cross-attention transducer modules:

f_F,m,f_E,m,f_E→F,m＝C-Transformer(f_F,m-1,f_E,m-1,f_E→F,m-1),m∈[1,M]

Wherein f _F,m-1,f_E,m-1,f_E→F,m-1 is three-way characteristics of the m-th cross-attention transducer module, f _F,m,f_E,m,f_E→F,m is three-way characteristics of the output, and C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, which normalizes each path of signal first and calculates respective attention information, adds and fuses the attention information of the three paths of signals to realize cross-modal information enhancement, and utilizes a multi-layer perceptron to perform characteristic enhancement, and outputs the enhanced characteristic f _F,m,f_E,m,f_E→F,m.

f_ALL＝DAF(f_F,M,f_E,M,f_E→F,M,F^r,F^E,r,F^E→F,r)

Wherein F _F,M,f_E,M,f_E→F,M is three enhanced features output in the previous step, F ^r,F^E,r,F^E→F,r is three original signals input by a network, F _ALL is a feature output after fusion, DAF (·) is a density sensing fusion operation, the input features are cascaded with the corresponding original input in feature dimension, and are input into a single-layer convolution layer for feature extraction, in this embodiment, the convolution kernel size is 3, and the number of convolution kernels is 32. And then, respectively extracting the average value and the maximum value of the three paths of features by using global average pooling and global maximum pooling, inputting the three paths of features into a full-connection layer to calculate a weight, and finally, carrying out weighted summation on the three paths of signals to output the fused feature f _ALL.

I_recon＝MF-Decode(f_ALL)

Wherein f _ALL is the fused feature output by the previous module, I _recon is the reconstructed luminance image, MF-Encoder (·) is the multi-mode feature decoding operation, and in this embodiment, the existing ResNet architecture is used as the multi-mode feature decoder.

The learning loss function of the network in the step 3 is defined as:

I _recon in the formula (4) is a brightness image reconstructed by the network, I _gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), the above-mentioned loss functions are all known functions commonly used at present, and β _per,β_L1,β_per is expressed as the weight of the corresponding loss, respectively, and in this routine, β _per＝1,β_L1＝32,β_per =0.0002 is set.

The input event stream and image frame data in the step 4 firstly need to carry out the refocusing process the same as that in the step 2, then carry out the preprocessing process of the event stream the same as that in the step 3, and then input the trained neural network, thus obtaining the corresponding target reconstruction image.

Figure 6 compares the results of the algorithm of the present invention with other synthetic aperture imaging algorithms in a variety of occlusion scenes. And (3) the synthetic aperture imaging results of densely-occluded scenes with different behaviors from top to bottom. The first column from left to right is an occlusion diagram, the second column is an occlusion-free reference image, the third column is a synthetic aperture imaging algorithm (F-SAI+ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI+CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI+ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI+hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.

It should be understood that technical portions that are not specifically set forth in the present specification are all prior art.

It should be understood that the foregoing description of the preferred embodiments is not to be construed as limiting the scope of the invention, but that substitutions and modifications can be made by one of ordinary skill in the art without departing from the scope of the invention as defined by the appended claims.

Claims

1. A synthetic aperture imaging method for fusing an event camera with a conventional optical camera, comprising the steps of:

Step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing an image frame data set in a non-occlusion scene;

Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, performing pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network model as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM optimizer by combining a non-occlusion image of the target and a network learning loss function to obtain a trained hybrid neural network model;

Step 4, inputting the shielded target event stream to be reconstructed and the image frame into a trained hybrid neural network model for prediction to obtain a de-shielded reconstructed image corresponding to the shielded target;

Refocus event frame F ^E,r after frame compression input by the hybrid neural network model, refocus Jiao Tu frame F ^r and refocus event stream pre-reconstruction event frame F ^E→F,r are input into a multi-mode coding module by three branches for coarse feature extraction:

f_F,0,f_E,0,f_E→F,0＝MF-Encoder(F^r,F^E,r,F^E→F,r)

Wherein F _F,0,f_E,0,f_E→F,0 represents the coarse features extracted from F ^r,F^E,r,F^E→F,r, and MF-Encoder (DEG) represents the multi-mode encoding operation, which performs feature extraction on three paths of signals; for two paths of signals F ^r,F^E→F,r, a three-layer convolution layer architecture containing jump connection is used for extracting features; for the F ^E,r signal, performing feature extraction by using three pulse layers containing jump links;

next, the features are enhanced using a cross-attention enhancement module:

f_F,M,f_E,M,f_E→F,M＝CME(f_F,0,f_E,0,f_E→F,0)

Wherein, f _F,0,f_E,0,f_E→F,0 is three-way characteristics output by the multi-mode coding module, f _F,M,f_E,M,f_E→F,M is three-way characteristics after cross-attention enhancement, CME (-) is cross-mode enhancement operation, and the characteristics are enhanced for a plurality of times by a transducer module containing M cross-attention, and the process is expressed as follows:

f_F,m,f_E,m,f_E→F,m＝C-Transformer(f_F,m-1,f_E,m-1,f_E→F,m-1)，m∈[1，M]

Wherein, f _F,m-1,f_E,m-1,f_E→F,m-1 is three-way characteristics input into an mth cross-attention transducer module, f _F,m,f_E,m,f_E→F,m is three-way characteristics output, C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, each path of signal is normalized firstly, the respective attention information is calculated, the cross-modal information enhancement is realized by adding and fusing the attention information of the three paths of signals, the characteristic enhancement is carried out by utilizing a multi-layer perceptron, and the enhanced characteristic f _F,m,f_E,m,f_E→F,m is output;

f_ALL＝DAF(f_F,M,f_E,M,f_E→F,M,F^r,F^E,r,F^E→F,r)

Wherein F _F,M,f_E,M,f_E→F,M is three paths of enhanced features output in the cross-attention enhancement module, F ^r,F^E,r,F^E→F,r is three paths of input original signals, F _ALL is a feature output after fusion, DAF (-) is a density perception fusion operation, the input features are cascaded in feature dimension with corresponding original input, the features are input into a single-layer convolution layer for feature extraction, then three paths of features are respectively extracted into a mean value and a maximum value by global average pooling and global maximum pooling, the three paths of features are input into a full connection layer for calculating weight, and finally weighted summation is carried out on the three paths of signals, and the fused features F _ALL are output;

I_recon＝MF-Decode(f_ALL)

Wherein f _ALL is the fused feature output by the density sensing module, I _recon is the reconstructed brightness image, and MF-Decoder (-) is the multi-mode feature decoding operation.

2. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the multi-view event stream data set E described in step 1 is represented as;

E＝{e_k＝(x_k,y_k,p_k,t_k)},k∈[1,K],x_k∈[1,W],y_k∈[1,H],p_k∈[1,-1]

Wherein e _k is the kth event point data, x _k,y_k is the pixel coordinate of the kth event point, p _k is the event polarity, a polarity of 1 represents the increase of the lighting intensity, and a polarity of-1 represents the decrease of the lighting intensity; t _k is the timestamp of the event point; w, H respectively represent the width and height of the space coordinates of the event points, and K represents the number of the event points.

3. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the multi-view image frame dataset F is represented as:

F＝{I_n}，n∈[1，N]，

Wherein I _n denotes nth image frame data, W, H denote width and height of an image frame, respectively, and N denotes the total number of image frames.

4. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the method for mapping the multi-view event stream image frame to the reference camera position in the step 2 is specifically as follows:

[x^r,y^r,1]＝KRK^-1[x，y，1]+KT/d (1)

in the formula (1), x ^r,y^r represents the mapped image coordinates, R, T represents a rotation and translation matrix from the pixel point to the reference camera position, and x, y represents the image coordinates; k represents an internal reference matrix of the camera, and d is a focusing distance;

From the mapping formula (1), the refocused event stream dataset E ^r can be obtained:

wherein, K-th event point after refocusing,/>Representing the x, y axis coordinates corresponding to the event point,/>For event polarity, a polarity of 1 indicates that the lighting intensity increases, and a polarity of-1 indicates that the lighting intensity decreases; /(I)As the time stamp of the event point, according to the mapping formula (1), the refocused image frame data set F ^r can be obtained:

wherein, Represented as the n Zhang Chong th focused image frame.

5. A synthetic aperture imaging method as defined in claim 4, wherein the synthetic aperture imaging method is used in combination with a conventional optical camera, wherein: the refocusing event stream framing procedure described in step 3 is expressed as:

wherein t _k is the timestamp of the kth event point, and t ₁ is the timestamp of the first event point; thus, a refocusing event frame data set after frame compression can be obtained:

6. A synthetic aperture imaging method as defined in claim 5, wherein the synthetic aperture imaging method is used in combination with a conventional optical camera, and wherein: the event stream pre-reconstruction and refocusing process in the step 3 is expressed as follows;

In the formula (3), The method is characterized in that the method comprises the steps of representing a J-th pre-reconstruction event frame, wherein J is the total number of pre-reconstruction event frames, x and y represent image coordinates, x _k,y_k is the image coordinates of a k-th event point, delta is a Dirichlet function, recon (·) is represented as an event stream brightness reconstruction operator, and delta t is represented as the time length adopted by single event frame compression frames; next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set F ^E→F,r after refocusing is obtained:

wherein, A pre-reconstructed event frame focused for the j Zhang Chong th.

7. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the learning loss function of the network in the step 3 is defined as:

I _recon in the formula (4) is a brightness image reconstructed by the hybrid neural network model, I _gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), and β _per,β_L1,β_per is expressed as the weight of the corresponding loss, respectively.

8. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the convolution kernels of the three layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32; the pulse cores of the three layers of pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.

9. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the transducer module is an existing Swin-transducer architecture, and the multi-mode feature decoder is an existing ResNet architecture.