CN114862732A

CN114862732A - Synthetic aperture imaging method fusing event camera and traditional optical camera

Info

Publication number: CN114862732A
Application number: CN202210422694.2A
Authority: CN
Inventors: 余磊; 廖伟; 张翔; 王阳光; 杨文�
Original assignee: Wuhan University WHU
Current assignee: Wuhan University WHU
Priority date: 2022-04-21
Filing date: 2022-04-21
Publication date: 2022-08-05
Anticipated expiration: 2042-04-21
Also published as: CN114862732B

Abstract

The method integrates the advantages of two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, reconstructs a high-quality non-occlusion target image frame, and completes a high-quality penetrating imaging task in multiple dense occlusion scenes. The invention comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning capability of the neural network, thereby realizing the high-quality image reconstruction of the target in various dense shielding scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.

Description

Synthetic aperture imaging method fusing event camera and traditional optical camera

Technical Field

The invention belongs to the field of image processing, and particularly relates to a synthetic aperture imaging method for fusing an event camera and a traditional optical camera.

Background

Synthetic Aperture Imaging (SAI) technology uses a camera to make multi-view observations of a scene, thereby being equivalent to a large-Aperture virtual camera. The larger the aperture of the camera is, the smaller the depth of field is, so that when a shooting target is shielded, the synthetic aperture imaging can realize the imaging of the shielded target by blurring the shielding object, and therefore, the synthetic aperture imaging method has extremely high application value in the fields of three-dimensional reconstruction, target tracking, identification and the like.

Current synthetic aperture imaging algorithms mainly use multi-view image frame sequences taken by conventional optical cameras for perspective imaging. However, as the occlusion is gradually dense, the target ray information contained in the image frame is sharply reduced, and the interference ray information of the occlusion occupies a dominant part, resulting in severe performance degradation of the synthetic aperture imaging algorithm based on the conventional optical camera. In recent years, a synthetic aperture imaging algorithm based on an event camera is proposed to solve the perspective imaging problem in a dense occlusion scene. The event camera senses the brightness change of the pixel points in a logarithmic domain, asynchronously outputs event stream data with high time resolution and high dynamic range, and can sense a target approximately continuously, so that sufficient target information can be obtained in a dense shielding scene. Because the existing event camera synthetic aperture imaging algorithm images based on the event points generated by the brightness difference between the target and the shelter, when in a sparse shelter scene, the performance of the event camera-based synthetic aperture imaging algorithm is degraded due to the reduction of effective event points. Considering that the synthetic aperture imaging algorithm based on the traditional optical camera and the event camera has better performance in a sparse occlusion scene and a dense occlusion scene respectively, the information of the sparse occlusion scene and the dense occlusion scene can be fully utilized for fusion imaging, and the synthetic aperture imaging in various dense occlusion scenes is realized. However, since the data modality of the event stream is completely different from that of the image frame data, how to establish a bridge between the two for fusion imaging still has great difficulty.

Disclosure of Invention

Aiming at the problems, the invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which integrates the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolutional neural network, and reconstructs a high-quality non-occlusion target image frame to finish a high-quality penetration imaging task in occlusion scenes with various dense degrees.

The invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which comprises the following specific steps:

step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene.

Step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;

step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstruction event frame, performing refocusing, inputting the pre-reconstruction event frame and the image frame data set after refocusing into the hybrid neural network as a training set to obtain a target reconstruction image predicted by the network, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis) optimizer by combining a non-occlusion image and a network learning loss function of the target to obtain a trained hybrid neural network;

the hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density perception module and a multi-mode decoding module;

the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;

the cross-attention enhancement module comprises a plurality of cross-attention Transformer modules and is used for enhancing the extracted features for a plurality of times;

the density sensing module is used for performing feature cascade on the enhanced features and the original features, then performing weighted summation on the features through a convolutional layer, a global average pooling layer, a global maximum pooling layer and a full connection layer, and outputting the fused features;

the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing an image according to the fused features;

and 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained network for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.

Further, the multi-view event stream data set E in step 1 is:

E＝{e _k ＝(x _k ,y _k ,p _k ,t _k )},k∈[1,K],x _k ∈[1,W],y _k ∈[1,H],p _k ∈[1,-1]

wherein e is _k For the kth event point data, x _k ,y _k Is the pixel coordinate of the kth event point, p _k Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness); t is t _k A timestamp of the event point; w and H respectively represent the width and the height of the spatial coordinates of the event points, and K represents the number of the event points.

Further, the multi-view image frame data set F in step 1 is:

wherein, I _n The data is represented as the nth image frame data, W and H represent the width and height of the image frame, respectively, and N represents the total number of image frames. Further, the manner of mapping the multi-view event stream image frame to the reference camera position in step 2 is specifically as follows:

[x ^r ,y ^r ,1]＝KRK ^-1 [x,y,1]+KT/d (1)

in the formula (1), x ^r ,y ^r Representing the mapped image coordinates, x and y representing the image coordinates, R and T representing the rotation and translation matrixes from the pixel points to the reference camera position, K representing the internal reference matrix of the camera, and d representing the internal reference matrix of the cameraThe refocusing distance is generally set to the distance of the occluded object from the camera plane. According to the mapping formula, the event stream data set E after refocusing can be obtained ^r ：

Wherein the content of the first and second substances,

for the k-th event point after refocusing,

representing the x, y axis coordinates of the event point correspondence,

event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness);

is the time stamp of the event point. According to the mapping formula (1), the refocused image frame data set F can be obtained ^r ：

Wherein the content of the first and second substances,

shown as the nth refocused image frame.

Further, the refocusing event stream compression frame process described in step 3 is represented as:

in the formula (2), the first and second groups,

the data of the event frame after the jth compression frame is shown, J is the total number of the event frames, x and y represent image coordinates,

for the image coordinate of the k-th refocusing event point, δ (·) is expressed as a dirichlet function, Δ t is expressed as the time length taken by a single event frame to press a frame, and the calculation method is expressed as:

wherein, t _k Is the time stamp of the kth event point, t ₁ Is the timestamp of the first event point. Therefore, a refocusing event frame data set F after frame pressing can be obtained ^E,r ：

Further, the event stream pre-reconstruction process described in step 3 is represented as:

in the formula (3), the first and second groups,

denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x _k ,y _k For the image coordinate of the kth event point, δ (·) is expressed as a dirichlet function, Recon (·) is expressed as an event stream brightness reconstruction operator, the current mainstream event stream reconstruction algorithm is generally used, and Δ t is expressed as the time length taken by a single event frame to press a frame. Next, for each image, using the mapping equation (1) described above

Mapping is carried out aiming at the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set F ^E→F,r ：

Wherein the content of the first and second substances,

the refocused pre-reconstruction event frame for the jth picture.

Further, the hybrid neural network model in step 3 includes the following modules: the system comprises a multi-modal coding module, a cross-attention enhancement module, a density perception module and a multi-modal decoding module. Refocusing event frame F after network input frame pressing ^E,r Refocusing the image frame F ^r And refocusing the event stream to pre-reconstruct the event frame F ^E→F,r Firstly, inputting three branches into a multi-mode coding module for coarse feature extraction

f _F,0 ,f _E,0 ,f _E→F,0 ＝MF-Encoder(F ^r ,F ^E,r ,F ^E→F,r )

Wherein, f _F,0 ,f _E,0 ,f _E→F,0 Respectively represent from F ^r ,F ^E,r ,F ^E→F,r The MF-Encoder (beta) represents a multi-mode encoding operation, and the MF-Encoder (beta) respectively extracts the characteristics of the three signals. For F ^r ,F ^E→F,r And two paths of signals are subjected to feature extraction by using a three-layer convolution layer structure containing jump connection. For F ^E,r And (3) carrying out feature extraction on the signal by using three pulse layers containing jump links.

Next, the features are enhanced using a cross-attention enhancement module:

f _F,M ,f _E,M ,f _E→F,M ＝CME(f _F,0 ,f _E,0 ,f _E→F,0 )

wherein f is _F,0 ,f _E,0 ,f _E→F,0 For the last step, multi-modal encodingThree-way characteristic of code module output, f _F,M ,f _E,M ,f _E→F,M For the three-way feature after cross-attention enhancement, CME (-) is a cross-modal enhancement operation, and a Transformer module containing M cross-attention performs multiple enhancement on the feature, and the process is expressed as:

f _F,m ,f _E,m ,f _E→F,m ＝C-Transformer(f _F,m-1 ,f _E,m-1 ,f _E→F,m-1 ),m∈[1,M]

wherein f is _F,m-1 ,f _E,m-1 ,f _E→F,m-1 For three-way features input into the mth cross attention Transformer module, f _F,m ,f _E,m ,f _E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f _F,m ,f _E,m ,f _E→F,m 。

Next, performing weighted fusion on the three-way features by using a density perception module:

f _ALL ＝DAF(f _F,M ,f _E,M ,f _E→F,M ,F ^r ,F ^E,r ,F ^E→F,r )

wherein f is _F,M ,f _E,M ,f _E→F,M For the three-way enhanced feature output in the previous step, F ^r ,F ^E,r ,F ^E→F,r Three original signals, f, for network input _ALL DAF (inverse discrete cosine function) is density perception fusion operation for fused output features, the input features are firstly cascaded with corresponding original input on feature dimension and input into a single-layer convolutional layer for feature extraction, then mean values and maximum values of three paths of features are respectively extracted by using global average pooling and global maximum pooling, the three paths of features are input into a full-connection layer for calculating weight values, finally, weighted summation is carried out on three paths of signals, and fused features f are output _ALL 。

Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:

I _recon ＝MF-Decode(f _ALL )

wherein f is _ALL Fused features, I, output for the last module _recon To reconstruct the resulting luminance image, MF-Encoder (·) is a multi-modal feature decoding operation, which generally uses the mainstream convolutional neural network architecture.

Further, the network learning loss function in step 3

Is defined as:

i in formula (4) _recon Luminance image reconstructed for the network, I _gt For a true value image corresponding to a target image in the data set,

to perceive loss (perceptual loss),

is a loss of the norm of L1,

expressed as total variation loss (total variation), which are all known functions commonly used at present, β _per ,β _L1 ,β _per Respectively, as weights corresponding to the losses.

Further, the event stream and image frame data input in step 4 need to be subjected to the same refocusing process as that in step 2, then subjected to the same event stream preprocessing process as that in step 3, and then input into the trained neural network to obtain a corresponding target reconstruction image.

Compared with the prior art, the invention has the advantages that:

the invention provides a synthetic aperture imaging method fusing an event camera and a traditional optical camera, which comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning capability of a neural network, thereby realizing the high-quality image reconstruction of a target in various dense occlusion scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.

Drawings

FIG. 1 is a schematic view of an experimental scene including a shielded target, a shield, a programmable slide rail, and a camera sensor.

FIG. 2 is an overall flow chart of the method of the present invention.

Fig. 3 is a schematic diagram of a process of acquiring data by moving a camera.

Fig. 4 is a comparison graph of image frames, event frames and pre-reconstruction event frames after data preprocessing.

Fig. 5 is a schematic diagram of a hybrid neural network structure.

Fig. 6 is a comparison of the present method with different synthetic aperture imaging methods. The first to four rows from top to bottom are the synthetic aperture imaging results of different dense occlusion scenes. From left to right, the first column is a schematic diagram of an occlusion, the second column is an unobstructed reference image, the third column is a synthetic aperture imaging algorithm (F-SAI + ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI + CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI + ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI + Hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.

Detailed Description

In order to clearly understand the present invention, the technical contents of the present invention will be more clearly and completely described below with reference to fig. 1 and an example. It is obvious that the described examples are only a part of the examples of the present invention, and not all of the examples of the implementation. All other examples, which can be obtained by a person skilled in the art without making any creative effort based on the examples in the present invention, belong to the protection scope of the present invention.

The invention relates to a synthetic aperture imaging method for fusing an event camera and a traditional optical camera.

The schematic scenario of the specific implementation of the present invention is shown in fig. 1, and includes: the system comprises a camera sensor, a programmable slide rail, a shelter and a sheltered target;

the camera sensor is a Davis346 event camera, and the camera can synchronously output event streams and image frame data and is used for constructing a corresponding data set;

the Davis346 event camera is mounted on a programmable slide rail that is set to move linearly at a constant speed;

the overall flow chart of the invention is shown in the attached figure 2, and the specific steps are as follows:

step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene, as shown in fig. 3;

the multi-view event stream data E in step 1 is

E＝{e _(k}) ＝(x _k ,y _k ,p _k ,t _k )},k∈[1,K],x _k ∈[1,W],y _k ∈[1,H],p _k ∈[1,-1]

Wherein e is _k For the kth event point data, x _k ,y _k Pixel coordinates of a kth event point; p is a radical of _k Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness); t is t _k A timestamp of the event point; w346, H260 each indicate the width and height of the event point space coordinate, and K indicates the number of event points.

The multi-view image frame data F in step 1 is:

wherein, I _n The nth image frame data is shown, W is346, H is 260 represents the width and height of the image frame, respectively, and N is 30 represents the total number of image frames.

the manner of mapping the multi-view event stream image frame to the reference camera position described in step 2 is specifically as follows:

[x ^r ,y ^r ,1]＝KRK ^-1 [x,y,1]+KT/d (1)

in the formula (1), x ^r ,y ^r And (3) representing the mapped image coordinates, R and T represent a rotation matrix and a translation matrix from the pixel point to the position of a reference camera, K represents an internal reference matrix of the camera, and d is a focusing distance which is generally set as the distance from a shielded target to a camera plane. In this example, since the set camera motion mode is a uniform linear motion, it can be considered that the clock does not rotate during the shooting process of the camera, and thus the rotation matrix is a diagonal unit matrix. At time t, the translation matrix calculation mode can be modeled as

T _t ＝[v _track (t-t _r )0 0]

Wherein v is _track 0.0885m/s is the moving speed of the guide rail, t _r For each shot, the camera is at a time stamp of the reference position.

According to the mapping formula (1), event stream data E of refocusing can be obtained ^r ：

Wherein the content of the first and second substances,

for the k-th event point after focusing,

representing the x, y-axis coordinates of the event point correspondences

Wherein the content of the first and second substances,

represented as the nth refocused image frame.

Step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstructed event frame, performing refocusing, inputting the pre-reconstructed event frame and the image frame data set after refocusing into the hybrid neural network as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis) optimizer by combining a non-occlusion image and a network learning loss function of a target to obtain a trained hybrid neural network;

the refocusing event streaming frame process described in step 3 is represented as:

in the formula (2), the first and second groups,

for the image coordinates of the k-th refocusing event point, δ (·) is expressed as a Dirichlet function, Δ t tableThe time length adopted by the frame pressing of a single event frame is shown, and the calculation mode is shown as follows:

The event stream pre-reconstruction and refocusing process described in step 3 is represented as:

in the formula (3), the first and second groups,

denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x _k ,y _k For the image coordinates of the kth event point, δ (-) is expressed as the dirichlet function, Recon (-) is expressed as the event stream luminance reconstruction operator, in this example, the existing E2VID algorithm is used, and Δ t represents the length of time taken for a single event frame to be framed. Next, for each image, using the mapping equation (1) described above

Mapping the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set:

wherein the content of the first and second substances,

for the jth focused pre-reconstructed event image.

Figure 4 shows visually the refocused image frame, the event frame and the pre-reconstructed event frame, respectively.

The structure of the hybrid neural network described in step 3 is shown in fig. 5. The hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-modal coding module, a cross-attention enhancement module, a density perception module and a multi-modal decoding module. Refocusing event frame F after network input frame pressing ^E,r Refocusing the image frame F ^r And refocusing the event stream to pre-reconstruct the event frame F ^E→F,r Inputting three branches into a multi-modal coding module for coarse feature extraction:

f _F,0 ,f _E,0 ,f _E→F,0 ＝MF-Encoder(F ^r ,F ^E,r ,F ^E→F,r )

wherein f is _F,0 ,f _E,0 ,f _E→F,0 Respectively represent from F ^r ,F ^E,r ,F ^E→F,r The MF-Encoder (beta) represents a multi-mode encoding operation, and the MF-Encoder (beta) respectively extracts the characteristics of the three signals. For F ^r ,F ^E→F,r In the embodiment, the sizes of convolution kernels of the three convolutional layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32. For F ^E,r The signal is subjected to feature extraction by using three pulse layers including jump links, wherein in the embodiment, the pulse core sizes of the three pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.

Next, the features are enhanced using a cross-attention enhancement module:

f _F,M ,f _E,M ,f _E→F,M ＝CME(f _F,0 ,f _E,0 ,f _E→F,0 )

wherein f is _F,0 ,f _E,0 ,f _E→F,0 For the last step, multi-modal codingThree-way characteristic of module output, f _F,M ,f _E,M ,f _E→F,M For the cross-attention enhanced three-way feature, CME (-) is a cross-modal enhancement operation that includes a transform module with M ═ 6 cross-attentions to enhance features multiple times:

wherein f is _F,m-1 ,f _E,m-1 ,f _E→F,m-1 For three-way features input into the mth cross attention Transformer module, f _F,m ,f _E,m ,f _E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f _F,m ,f _E,m ,f _E→F,m In this embodiment, the transform module used is the conventional Swin-transform architecture.

f _ALL ＝DAF(f _F,M ,f _E,M ,f _E→F,M ,F ^r ,F ^E,r ,F ^E→F,r )

wherein, f _F,M ,f _E,M ,f _E→F,M For the three-way enhanced feature output in the previous step, F ^r ,F ^E,r ,F ^E→F,r Three original signals, f, for network input _ALL For the output features after fusion, DAF (·) is a density-aware fusion operation, and the input features are first cascaded with the corresponding original input in feature dimensions and input into a single-layer convolutional layer for feature extraction, where in this embodiment, the size of the convolutional kernel is3, and the number of the convolutional kernels is 32. Then using global average pooling and global maximum pooling to respectively extract average value and maximum value of three-path features, inputting the three-path features into a full-connection layer to calculate weight, and finally carrying out three-path signal processingWeighted summation, outputting the fused feature f _ALL 。

I _recon ＝MF-Decode(f _ALL )

wherein f is _ALL Fused features output for the previous module, I _recon To reconstruct the resulting luma image, MF-Encoder (. cndot.) is a multi-modal feature decoding operation, which in this embodiment uses the existing ResNet architecture as a multi-modal feature decoder.

The network learning loss function in step 3 is defined as:

to perceive loss (perceptual loss),

is a loss of the norm of L1,

expressed as total variation loss (total variation), which are all known functions commonly used at present, β _per ,β _L1 ,β _per Respectively, as weights corresponding to the losses, in this example, set β _per ＝1,β _L1 ＝32,β _per ＝0.0002。

The event stream and image frame data input in the step 4 need to be subjected to the same refocusing process as that in the step 2, then subjected to the same event stream preprocessing process as that in the step 3, and then input into the trained neural network to obtain a corresponding target reconstruction image.

Figure 6 compares the results of the algorithm of the present invention with other synthetic aperture imaging algorithms in a variety of occluded scenes. The first to four rows from top to bottom are the synthetic aperture imaging results of different dense occlusion scenes. From left to right, the first column is a schematic diagram of an occlusion, the second column is an unobstructed reference image, the third column is a synthetic aperture imaging algorithm (F-SAI + ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI + CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI + ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI + Hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.

It should be understood that technical parts not described in detail in the specification belong to the prior art.

It should be understood that the above-described preferred embodiments are illustrative and not restrictive of the scope of the invention, and that various changes, substitutions and alterations can be made herein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method for synthetic aperture imaging combining an event camera with a conventional optical camera, comprising the steps of:

step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene;

step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstruction event frame, performing refocusing, inputting the pre-reconstruction event frame and the image frame data set after refocusing into the hybrid neural network model as a training set to obtain a target reconstruction image after network prediction, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis model) optimizer by combining a non-occlusion image of a target and a network learning loss function to obtain a trained hybrid neural network model;

and 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained mixed neural network model for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.

2. The method of claim 1, wherein the method comprises the steps of: the multi-view event stream data set E described in step 1 is represented as;

E＝{e _k ＝(x _k ，y _k ，p _k ，t _k )}，k∈[1，K]，x _k ∈[1，W]，y _k ∈[1，H]，p _k ∈[1，-1]

wherein e is _k For the kth event point data, x _k ，y _k Is the pixel coordinate of the kth event point, p _k For event polarity, a polarity of 1 indicates an increase in the illumination, and a polarity of-1 indicates a decrease in the illumination; t is t _k A timestamp of the event point; w and H respectively represent the width and the height of the spatial coordinates of the event points, and K represents the number of the event points.

3. The method of claim 1, wherein the method comprises the steps of: the multi-view image frame data set F is represented as:

wherein, I _n The data is represented as the nth image frame data, W and H represent the width and height of the image frame, respectively, and N represents the total number of image frames.

4. The method of claim 1, wherein the method comprises the steps of: the manner of mapping the multi-view event stream image frame to the reference camera position described in step 2 is specifically as follows:

[x ^r ，y ^r ，1]＝KRK ^-1 [x，y，1]+KT/d (1)

in the formula (1), x ^r ，y ^r Representing the mapped image coordinates, wherein R and T represent rotation and translation matrixes from pixel points to the position of a reference camera, and x and y represent image coordinates; k represents an internal reference matrix of the camera, and d is a focusing distance;

according to the mapping formula (1), a refocused event stream data set E can be obtained ^r ：

Wherein the content of the first and second substances,

for the k-th event point after refocusing,

representing the x, y axis coordinates of the event point correspondence,

for event polarity, a polarity of 1 indicates an increase in the illumination, and a polarity of-1 indicates a decrease in the illumination;

for the time stamp of the event point, according to the mapping formula (1), the refocused image frame data set Fr can be obtained:

wherein the content of the first and second substances,

shown as the nth refocused image frame.

5. The method of claim 4, wherein the event camera and the conventional optical camera are integrated into a synthetic aperture imaging method, and the method comprises the following steps: the refocusing event streaming frame process described in step 3 is represented as:

in the formula (2), the first and second groups,

shown as the data of the event frame after the jth compressed frame, J is the total number of the event frames, x, y here represents the imageThe coordinates of the position of the object to be imaged,

wherein, t _k Is the time stamp of the kth event point, t ₁ A timestamp of the first event point; therefore, a refocusing event frame data set after framing can be obtained:

6. the method of claim 5, wherein the method comprises the steps of: the event stream pre-reconstruction and refocusing process described in step 3 is represented as;

in the formula (3), the first and second groups,

denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x _k ，y _k Delta is a Dirichlet function, Recon () is expressed as an event stream brightness reconstruction operator, and delta t is expressed as the time length adopted by a single event frame to press a frame; next, for each image, using the mapping equation (1) described above

Mapping is carried out aiming at the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set F ^E→F，r ：

Wherein the content of the first and second substances,

the refocused pre-reconstruction event frame for the jth picture.

7. A method of synthetic aperture imaging fusing an event camera with a conventional optical camera according to claim 6, characterized by: refocusing event frame F after frame pressing input by hybrid neural network model ^E，r Refocusing the image frame F ^r And refocusing the event stream to pre-reconstruct the event frame F ^E→F，r Inputting three branches into a multi-modal coding module for coarse feature extraction:

f _F，0 ，f _E，0 ，f _E→F.0 ＝MF-Encoder(F ^r ，F ^E，r ，F ^E→F，r )

wherein f is _F，0 ，f _E，0 ，f _E→F，0 Respectively represent from F ^r ，F ^E，r ，F ^E→F，r The coarse features obtained by extraction, MF-Encoder (·) represents multi-mode coding operation, and the coarse features extract three paths of signals respectively; for F ^r ，F ^E→F，r Two paths of signals are respectively subjected to feature extraction by using a three-layer convolution layer architecture containing jump connection; for F ^E，r A signal, feature extraction using three pulse layers including a hopping link;

next, the features are enhanced using a cross-attention enhancement module:

f _F，M ，f _E，M ，f _E→F，M ＝CME(f _F，0 ，f _E，0 ，f _E→F，O )

wherein f is _F，0 ，f _E，0 ，f _E→F，0 Three-way characteristics, f, output by the multimodal coding module _F，M ，f _E，M ，f _E→F，M For the three-way feature after cross-attention enhancement, CME (-) is a cross-modal enhancement operation, and a Transformer module containing M cross-attention performs multiple enhancement on the feature, and the process is expressed as:

f _F，m ，f _E，m ，f _E→F，m ＝C-Transformer(f _F，m-1 ，f _E，m-1 ，f _E→F，m-1 )，m∈[1，M]

wherein f is _F，m-1 ，f _E，m-1 ，f _E→F，m-1 For three-way features input into the mth cross attention Transformer module, f _F，m ，f _E，m ，f _E→F，m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f _F，m ，f _E，m ，f _E→F，m ；

f _ALL ＝DAF(f _F，M ，f _E，M ，f _E→F，M ，F _r ，F ^E，r ，F ^E→F，r )

wherein f is _F，M ，f _E，M ，f _E→F，M For a three-way enhanced feature across the output in the attention enhancement module, F ^r ，F ^E，r ，F ^E→F，r For the input three original signals, f _ALL DAF (inverse discrete cosine transform) is density perception fusion operation for fused output features, input features are firstly cascaded with corresponding original input on feature dimension and input into a single-layer convolutional layer for feature extraction, and then mean values and sum values of three paths of features are extracted by using global average pooling and global maximum pooling respectivelyMaximum value is input into the full connection layer to calculate weight, and finally, three paths of signals are weighted and summed, and the fused characteristic f is output _ALL ；

I _recon ＝MF-Decode(f _ALL )

wherein f is _ALL Fused features for output of the density sensing module, I _recon To reconstruct the resulting luminance image, MF-Encoder (. cndot.) is a multi-modal feature decoding operation.

8. The method of claim 1, wherein the event camera is fused with a conventional optical camera, and the method comprises: the network learning loss function in step 3 is defined as:

i in formula (4) _recon Reconstructed luminance image for hybrid neural network model, I _gt For a true value image corresponding to a target image in the data set,

to perceive loss (perceptual loss),

is a loss of the norm of L1,

expressed as total variation loss (totalvaration), beta _per ，β _L1 ，β _per Respectively, as weights corresponding to the losses.

9. The method of claim 7, wherein the event camera is fused with a conventional optical camera, and the method comprises: the sizes of convolution kernels of the three layers of convolution layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32; the pulse core sizes of the three pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.

10. The method of claim 7, wherein the event camera is fused with a conventional optical camera, and the method comprises: the Transformer module is an existing Swin-Transformer architecture, and the multi-mode feature decoder is an existing ResNet architecture.