CN114862732A - Synthetic aperture imaging method fusing event camera and traditional optical camera - Google Patents

Synthetic aperture imaging method fusing event camera and traditional optical camera Download PDF

Info

Publication number
CN114862732A
CN114862732A CN202210422694.2A CN202210422694A CN114862732A CN 114862732 A CN114862732 A CN 114862732A CN 202210422694 A CN202210422694 A CN 202210422694A CN 114862732 A CN114862732 A CN 114862732A
Authority
CN
China
Prior art keywords
event
image
frame
camera
features
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210422694.2A
Other languages
Chinese (zh)
Other versions
CN114862732B (en
Inventor
余磊
廖伟
张翔
王阳光
杨文�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210422694.2A priority Critical patent/CN114862732B/en
Publication of CN114862732A publication Critical patent/CN114862732A/en
Application granted granted Critical
Publication of CN114862732B publication Critical patent/CN114862732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The method integrates the advantages of two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, reconstructs a high-quality non-occlusion target image frame, and completes a high-quality penetrating imaging task in multiple dense occlusion scenes. The invention comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning capability of the neural network, thereby realizing the high-quality image reconstruction of the target in various dense shielding scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.

Description

Synthetic aperture imaging method fusing event camera and traditional optical camera
Technical Field
The invention belongs to the field of image processing, and particularly relates to a synthetic aperture imaging method for fusing an event camera and a traditional optical camera.
Background
Synthetic Aperture Imaging (SAI) technology uses a camera to make multi-view observations of a scene, thereby being equivalent to a large-Aperture virtual camera. The larger the aperture of the camera is, the smaller the depth of field is, so that when a shooting target is shielded, the synthetic aperture imaging can realize the imaging of the shielded target by blurring the shielding object, and therefore, the synthetic aperture imaging method has extremely high application value in the fields of three-dimensional reconstruction, target tracking, identification and the like.
Current synthetic aperture imaging algorithms mainly use multi-view image frame sequences taken by conventional optical cameras for perspective imaging. However, as the occlusion is gradually dense, the target ray information contained in the image frame is sharply reduced, and the interference ray information of the occlusion occupies a dominant part, resulting in severe performance degradation of the synthetic aperture imaging algorithm based on the conventional optical camera. In recent years, a synthetic aperture imaging algorithm based on an event camera is proposed to solve the perspective imaging problem in a dense occlusion scene. The event camera senses the brightness change of the pixel points in a logarithmic domain, asynchronously outputs event stream data with high time resolution and high dynamic range, and can sense a target approximately continuously, so that sufficient target information can be obtained in a dense shielding scene. Because the existing event camera synthetic aperture imaging algorithm images based on the event points generated by the brightness difference between the target and the shelter, when in a sparse shelter scene, the performance of the event camera-based synthetic aperture imaging algorithm is degraded due to the reduction of effective event points. Considering that the synthetic aperture imaging algorithm based on the traditional optical camera and the event camera has better performance in a sparse occlusion scene and a dense occlusion scene respectively, the information of the sparse occlusion scene and the dense occlusion scene can be fully utilized for fusion imaging, and the synthetic aperture imaging in various dense occlusion scenes is realized. However, since the data modality of the event stream is completely different from that of the image frame data, how to establish a bridge between the two for fusion imaging still has great difficulty.
Disclosure of Invention
Aiming at the problems, the invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which integrates the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolutional neural network, and reconstructs a high-quality non-occlusion target image frame to finish a high-quality penetration imaging task in occlusion scenes with various dense degrees.
The invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which comprises the following specific steps:
step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene.
Step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;
step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstruction event frame, performing refocusing, inputting the pre-reconstruction event frame and the image frame data set after refocusing into the hybrid neural network as a training set to obtain a target reconstruction image predicted by the network, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis) optimizer by combining a non-occlusion image and a network learning loss function of the target to obtain a trained hybrid neural network;
the hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density perception module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
the cross-attention enhancement module comprises a plurality of cross-attention Transformer modules and is used for enhancing the extracted features for a plurality of times;
the density sensing module is used for performing feature cascade on the enhanced features and the original features, then performing weighted summation on the features through a convolutional layer, a global average pooling layer, a global maximum pooling layer and a full connection layer, and outputting the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing an image according to the fused features;
and 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained network for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.
Further, the multi-view event stream data set E in step 1 is:
E={e k =(x k ,y k ,p k ,t k )},k∈[1,K],x k ∈[1,W],y k ∈[1,H],p k ∈[1,-1]
wherein e is k For the kth event point data, x k ,y k Is the pixel coordinate of the kth event point, p k Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness); t is t k A timestamp of the event point; w and H respectively represent the width and the height of the spatial coordinates of the event points, and K represents the number of the event points.
Further, the multi-view image frame data set F in step 1 is:
Figure BDA0003607164140000021
wherein, I n The data is represented as the nth image frame data, W and H represent the width and height of the image frame, respectively, and N represents the total number of image frames. Further, the manner of mapping the multi-view event stream image frame to the reference camera position in step 2 is specifically as follows:
[x r ,y r ,1]=KRK -1 [x,y,1]+KT/d (1)
in the formula (1), x r ,y r Representing the mapped image coordinates, x and y representing the image coordinates, R and T representing the rotation and translation matrixes from the pixel points to the reference camera position, K representing the internal reference matrix of the camera, and d representing the internal reference matrix of the cameraThe refocusing distance is generally set to the distance of the occluded object from the camera plane. According to the mapping formula, the event stream data set E after refocusing can be obtained r
Figure BDA0003607164140000031
Wherein the content of the first and second substances,
Figure BDA0003607164140000032
for the k-th event point after refocusing,
Figure BDA0003607164140000033
representing the x, y axis coordinates of the event point correspondence,
Figure BDA0003607164140000034
event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness);
Figure BDA0003607164140000035
is the time stamp of the event point. According to the mapping formula (1), the refocused image frame data set F can be obtained r
Figure BDA0003607164140000036
Wherein the content of the first and second substances,
Figure BDA0003607164140000037
shown as the nth refocused image frame.
Further, the refocusing event stream compression frame process described in step 3 is represented as:
Figure BDA0003607164140000038
in the formula (2), the first and second groups,
Figure BDA0003607164140000039
the data of the event frame after the jth compression frame is shown, J is the total number of the event frames, x and y represent image coordinates,
Figure BDA00036071641400000310
for the image coordinate of the k-th refocusing event point, δ (·) is expressed as a dirichlet function, Δ t is expressed as the time length taken by a single event frame to press a frame, and the calculation method is expressed as:
Figure BDA00036071641400000311
wherein, t k Is the time stamp of the kth event point, t 1 Is the timestamp of the first event point. Therefore, a refocusing event frame data set F after frame pressing can be obtained E,r
Figure BDA00036071641400000312
Further, the event stream pre-reconstruction process described in step 3 is represented as:
Figure BDA00036071641400000313
in the formula (3), the first and second groups,
Figure BDA00036071641400000314
denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x k ,y k For the image coordinate of the kth event point, δ (·) is expressed as a dirichlet function, Recon (·) is expressed as an event stream brightness reconstruction operator, the current mainstream event stream reconstruction algorithm is generally used, and Δ t is expressed as the time length taken by a single event frame to press a frame. Next, for each image, using the mapping equation (1) described above
Figure BDA00036071641400000315
Mapping is carried out aiming at the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set F E→F,r
Figure BDA0003607164140000041
Wherein the content of the first and second substances,
Figure BDA0003607164140000042
the refocused pre-reconstruction event frame for the jth picture.
Further, the hybrid neural network model in step 3 includes the following modules: the system comprises a multi-modal coding module, a cross-attention enhancement module, a density perception module and a multi-modal decoding module. Refocusing event frame F after network input frame pressing E,r Refocusing the image frame F r And refocusing the event stream to pre-reconstruct the event frame F E→F,r Firstly, inputting three branches into a multi-mode coding module for coarse feature extraction
f F,0 ,f E,0 ,f E→F,0 =MF-Encoder(F r ,F E,r ,F E→F,r )
Wherein, f F,0 ,f E,0 ,f E→F,0 Respectively represent from F r ,F E,r ,F E→F,r The MF-Encoder (beta) represents a multi-mode encoding operation, and the MF-Encoder (beta) respectively extracts the characteristics of the three signals. For F r ,F E→F,r And two paths of signals are subjected to feature extraction by using a three-layer convolution layer structure containing jump connection. For F E,r And (3) carrying out feature extraction on the signal by using three pulse layers containing jump links.
Next, the features are enhanced using a cross-attention enhancement module:
f F,M ,f E,M ,f E→F,M =CME(f F,0 ,f E,0 ,f E→F,0 )
wherein f is F,0 ,f E,0 ,f E→F,0 For the last step, multi-modal encodingThree-way characteristic of code module output, f F,M ,f E,M ,f E→F,M For the three-way feature after cross-attention enhancement, CME (-) is a cross-modal enhancement operation, and a Transformer module containing M cross-attention performs multiple enhancement on the feature, and the process is expressed as:
f F,m ,f E,m ,f E→F,m =C-Transformer(f F,m-1 ,f E,m-1 ,f E→F,m-1 ),m∈[1,M]
wherein f is F,m-1 ,f E,m-1 ,f E→F,m-1 For three-way features input into the mth cross attention Transformer module, f F,m ,f E,m ,f E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f F,m ,f E,m ,f E→F,m
Next, performing weighted fusion on the three-way features by using a density perception module:
f ALL =DAF(f F,M ,f E,M ,f E→F,M ,F r ,F E,r ,F E→F,r )
wherein f is F,M ,f E,M ,f E→F,M For the three-way enhanced feature output in the previous step, F r ,F E,r ,F E→F,r Three original signals, f, for network input ALL DAF (inverse discrete cosine function) is density perception fusion operation for fused output features, the input features are firstly cascaded with corresponding original input on feature dimension and input into a single-layer convolutional layer for feature extraction, then mean values and maximum values of three paths of features are respectively extracted by using global average pooling and global maximum pooling, the three paths of features are input into a full-connection layer for calculating weight values, finally, weighted summation is carried out on three paths of signals, and fused features f are output ALL
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
I recon =MF-Decode(f ALL )
wherein f is ALL Fused features, I, output for the last module recon To reconstruct the resulting luminance image, MF-Encoder (·) is a multi-modal feature decoding operation, which generally uses the mainstream convolutional neural network architecture.
Further, the network learning loss function in step 3
Figure BDA0003607164140000051
Is defined as:
Figure BDA0003607164140000052
i in formula (4) recon Luminance image reconstructed for the network, I gt For a true value image corresponding to a target image in the data set,
Figure BDA0003607164140000053
to perceive loss (perceptual loss),
Figure BDA0003607164140000054
is a loss of the norm of L1,
Figure BDA0003607164140000055
expressed as total variation loss (total variation), which are all known functions commonly used at present, β perL1per Respectively, as weights corresponding to the losses.
Further, the event stream and image frame data input in step 4 need to be subjected to the same refocusing process as that in step 2, then subjected to the same event stream preprocessing process as that in step 3, and then input into the trained neural network to obtain a corresponding target reconstruction image.
Compared with the prior art, the invention has the advantages that:
the invention provides a synthetic aperture imaging method fusing an event camera and a traditional optical camera, which comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning capability of a neural network, thereby realizing the high-quality image reconstruction of a target in various dense occlusion scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.
Drawings
FIG. 1 is a schematic view of an experimental scene including a shielded target, a shield, a programmable slide rail, and a camera sensor.
FIG. 2 is an overall flow chart of the method of the present invention.
Fig. 3 is a schematic diagram of a process of acquiring data by moving a camera.
Fig. 4 is a comparison graph of image frames, event frames and pre-reconstruction event frames after data preprocessing.
Fig. 5 is a schematic diagram of a hybrid neural network structure.
Fig. 6 is a comparison of the present method with different synthetic aperture imaging methods. The first to four rows from top to bottom are the synthetic aperture imaging results of different dense occlusion scenes. From left to right, the first column is a schematic diagram of an occlusion, the second column is an unobstructed reference image, the third column is a synthetic aperture imaging algorithm (F-SAI + ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI + CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI + ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI + Hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
Detailed Description
In order to clearly understand the present invention, the technical contents of the present invention will be more clearly and completely described below with reference to fig. 1 and an example. It is obvious that the described examples are only a part of the examples of the present invention, and not all of the examples of the implementation. All other examples, which can be obtained by a person skilled in the art without making any creative effort based on the examples in the present invention, belong to the protection scope of the present invention.
The invention relates to a synthetic aperture imaging method for fusing an event camera and a traditional optical camera.
The schematic scenario of the specific implementation of the present invention is shown in fig. 1, and includes: the system comprises a camera sensor, a programmable slide rail, a shelter and a sheltered target;
the camera sensor is a Davis346 event camera, and the camera can synchronously output event streams and image frame data and is used for constructing a corresponding data set;
the Davis346 event camera is mounted on a programmable slide rail that is set to move linearly at a constant speed;
the overall flow chart of the invention is shown in the attached figure 2, and the specific steps are as follows:
step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene, as shown in fig. 3;
the multi-view event stream data E in step 1 is
E={e (k}) =(x k ,y k ,p k ,t k )},k∈[1,K],x k ∈[1,W],y k ∈[1,H],p k ∈[1,-1]
Wherein e is k For the kth event point data, x k ,y k Pixel coordinates of a kth event point; p is a radical of k Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness); t is t k A timestamp of the event point; w346, H260 each indicate the width and height of the event point space coordinate, and K indicates the number of event points.
The multi-view image frame data F in step 1 is:
Figure BDA0003607164140000061
wherein, I n The nth image frame data is shown, W is346, H is 260 represents the width and height of the image frame, respectively, and N is 30 represents the total number of image frames.
Step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;
the manner of mapping the multi-view event stream image frame to the reference camera position described in step 2 is specifically as follows:
[x r ,y r ,1]=KRK -1 [x,y,1]+KT/d (1)
in the formula (1), x r ,y r And (3) representing the mapped image coordinates, R and T represent a rotation matrix and a translation matrix from the pixel point to the position of a reference camera, K represents an internal reference matrix of the camera, and d is a focusing distance which is generally set as the distance from a shielded target to a camera plane. In this example, since the set camera motion mode is a uniform linear motion, it can be considered that the clock does not rotate during the shooting process of the camera, and thus the rotation matrix is a diagonal unit matrix. At time t, the translation matrix calculation mode can be modeled as
T t =[v track (t-t r )0 0]
Wherein v is track 0.0885m/s is the moving speed of the guide rail, t r For each shot, the camera is at a time stamp of the reference position.
According to the mapping formula (1), event stream data E of refocusing can be obtained r
Figure BDA0003607164140000071
Wherein the content of the first and second substances,
Figure BDA0003607164140000072
for the k-th event point after focusing,
Figure BDA0003607164140000073
representing the x, y-axis coordinates of the event point correspondences
Figure BDA0003607164140000074
Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness);
Figure BDA0003607164140000075
is the time stamp of the event point. According to the mapping formula (1), the refocused image frame data set F can be obtained r
Figure BDA0003607164140000076
Wherein the content of the first and second substances,
Figure BDA0003607164140000077
represented as the nth refocused image frame.
Step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstructed event frame, performing refocusing, inputting the pre-reconstructed event frame and the image frame data set after refocusing into the hybrid neural network as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis) optimizer by combining a non-occlusion image and a network learning loss function of a target to obtain a trained hybrid neural network;
the refocusing event streaming frame process described in step 3 is represented as:
Figure BDA0003607164140000078
in the formula (2), the first and second groups,
Figure BDA0003607164140000079
the data of the event frame after the jth compression frame is shown, J is the total number of the event frames, x and y represent image coordinates,
Figure BDA00036071641400000710
for the image coordinates of the k-th refocusing event point, δ (·) is expressed as a Dirichlet function, Δ t tableThe time length adopted by the frame pressing of a single event frame is shown, and the calculation mode is shown as follows:
Figure BDA00036071641400000711
wherein, t k Is the time stamp of the kth event point, t 1 Is the timestamp of the first event point. Therefore, a refocusing event frame data set F after frame pressing can be obtained E,r
Figure BDA0003607164140000081
The event stream pre-reconstruction and refocusing process described in step 3 is represented as:
Figure BDA0003607164140000082
in the formula (3), the first and second groups,
Figure BDA0003607164140000083
denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x k ,y k For the image coordinates of the kth event point, δ (-) is expressed as the dirichlet function, Recon (-) is expressed as the event stream luminance reconstruction operator, in this example, the existing E2VID algorithm is used, and Δ t represents the length of time taken for a single event frame to be framed. Next, for each image, using the mapping equation (1) described above
Figure BDA0003607164140000084
Mapping the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set:
Figure BDA0003607164140000085
wherein the content of the first and second substances,
Figure BDA0003607164140000086
for the jth focused pre-reconstructed event image.
Figure 4 shows visually the refocused image frame, the event frame and the pre-reconstructed event frame, respectively.
The structure of the hybrid neural network described in step 3 is shown in fig. 5. The hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-modal coding module, a cross-attention enhancement module, a density perception module and a multi-modal decoding module. Refocusing event frame F after network input frame pressing E,r Refocusing the image frame F r And refocusing the event stream to pre-reconstruct the event frame F E→F,r Inputting three branches into a multi-modal coding module for coarse feature extraction:
f F,0 ,f E,0 ,f E→F,0 =MF-Encoder(F r ,F E,r ,F E→F,r )
wherein f is F,0 ,f E,0 ,f E→F,0 Respectively represent from F r ,F E,r ,F E→F,r The MF-Encoder (beta) represents a multi-mode encoding operation, and the MF-Encoder (beta) respectively extracts the characteristics of the three signals. For F r ,F E→F,r In the embodiment, the sizes of convolution kernels of the three convolutional layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32. For F E,r The signal is subjected to feature extraction by using three pulse layers including jump links, wherein in the embodiment, the pulse core sizes of the three pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.
Next, the features are enhanced using a cross-attention enhancement module:
f F,M ,f E,M ,f E→F,M =CME(f F,0 ,f E,0 ,f E→F,0 )
wherein f is F,0 ,f E,0 ,f E→F,0 For the last step, multi-modal codingThree-way characteristic of module output, f F,M ,f E,M ,f E→F,M For the cross-attention enhanced three-way feature, CME (-) is a cross-modal enhancement operation that includes a transform module with M ═ 6 cross-attentions to enhance features multiple times:
f F,m ,f E,m ,f E→F,m =C-Transformer(f F,m-1 ,f E,m-1 ,f E→F,m-1 ),m∈[1,M]
wherein f is F,m-1 ,f E,m-1 ,f E→F,m-1 For three-way features input into the mth cross attention Transformer module, f F,m ,f E,m ,f E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f F,m ,f E,m ,f E→F,m In this embodiment, the transform module used is the conventional Swin-transform architecture.
Next, performing weighted fusion on the three-way features by using a density perception module:
f ALL =DAF(f F,M ,f E,M ,f E→F,M ,F r ,F E,r ,F E→F,r )
wherein, f F,M ,f E,M ,f E→F,M For the three-way enhanced feature output in the previous step, F r ,F E,r ,F E→F,r Three original signals, f, for network input ALL For the output features after fusion, DAF (·) is a density-aware fusion operation, and the input features are first cascaded with the corresponding original input in feature dimensions and input into a single-layer convolutional layer for feature extraction, where in this embodiment, the size of the convolutional kernel is3, and the number of the convolutional kernels is 32. Then using global average pooling and global maximum pooling to respectively extract average value and maximum value of three-path features, inputting the three-path features into a full-connection layer to calculate weight, and finally carrying out three-path signal processingWeighted summation, outputting the fused feature f ALL
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
I recon =MF-Decode(f ALL )
wherein f is ALL Fused features output for the previous module, I recon To reconstruct the resulting luma image, MF-Encoder (. cndot.) is a multi-modal feature decoding operation, which in this embodiment uses the existing ResNet architecture as a multi-modal feature decoder.
The network learning loss function in step 3 is defined as:
Figure BDA0003607164140000091
i in formula (4) recon Luminance image reconstructed for the network, I gt For a true value image corresponding to a target image in the data set,
Figure BDA0003607164140000092
to perceive loss (perceptual loss),
Figure BDA0003607164140000093
is a loss of the norm of L1,
Figure BDA0003607164140000094
expressed as total variation loss (total variation), which are all known functions commonly used at present, β perL1per Respectively, as weights corresponding to the losses, in this example, set β per =1,β L1 =32,β per =0.0002。
And 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained network for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.
The event stream and image frame data input in the step 4 need to be subjected to the same refocusing process as that in the step 2, then subjected to the same event stream preprocessing process as that in the step 3, and then input into the trained neural network to obtain a corresponding target reconstruction image.
Figure 6 compares the results of the algorithm of the present invention with other synthetic aperture imaging algorithms in a variety of occluded scenes. The first to four rows from top to bottom are the synthetic aperture imaging results of different dense occlusion scenes. From left to right, the first column is a schematic diagram of an occlusion, the second column is an unobstructed reference image, the third column is a synthetic aperture imaging algorithm (F-SAI + ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI + CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI + ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI + Hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
It should be understood that technical parts not described in detail in the specification belong to the prior art.
It should be understood that the above-described preferred embodiments are illustrative and not restrictive of the scope of the invention, and that various changes, substitutions and alterations can be made herein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (10)

1. A method for synthetic aperture imaging combining an event camera with a conventional optical camera, comprising the steps of:
step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene;
step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;
step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstruction event frame, performing refocusing, inputting the pre-reconstruction event frame and the image frame data set after refocusing into the hybrid neural network model as a training set to obtain a target reconstruction image after network prediction, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis model) optimizer by combining a non-occlusion image of a target and a network learning loss function to obtain a trained hybrid neural network model;
the hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density perception module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
the cross-attention enhancement module comprises a plurality of cross-attention Transformer modules and is used for enhancing the extracted features for a plurality of times;
the density sensing module is used for performing feature cascade on the enhanced features and the original features, then performing weighted summation on the features through a convolutional layer, a global average pooling layer, a global maximum pooling layer and a full connection layer, and outputting the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing an image according to the fused features;
and 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained mixed neural network model for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.
2. The method of claim 1, wherein the method comprises the steps of: the multi-view event stream data set E described in step 1 is represented as;
E={e k =(x k ,y k ,p k ,t k )},k∈[1,K],x k ∈[1,W],y k ∈[1,H],p k ∈[1,-1]
wherein e is k For the kth event point data, x k ,y k Is the pixel coordinate of the kth event point, p k For event polarity, a polarity of 1 indicates an increase in the illumination, and a polarity of-1 indicates a decrease in the illumination; t is t k A timestamp of the event point; w and H respectively represent the width and the height of the spatial coordinates of the event points, and K represents the number of the event points.
3. The method of claim 1, wherein the method comprises the steps of: the multi-view image frame data set F is represented as:
Figure FDA0003607164130000021
wherein, I n The data is represented as the nth image frame data, W and H represent the width and height of the image frame, respectively, and N represents the total number of image frames.
4. The method of claim 1, wherein the method comprises the steps of: the manner of mapping the multi-view event stream image frame to the reference camera position described in step 2 is specifically as follows:
[x r ,y r ,1]=KRK -1 [x,y,1]+KT/d (1)
in the formula (1), x r ,y r Representing the mapped image coordinates, wherein R and T represent rotation and translation matrixes from pixel points to the position of a reference camera, and x and y represent image coordinates; k represents an internal reference matrix of the camera, and d is a focusing distance;
according to the mapping formula (1), a refocused event stream data set E can be obtained r
Figure FDA0003607164130000022
Wherein the content of the first and second substances,
Figure FDA0003607164130000023
for the k-th event point after refocusing,
Figure FDA0003607164130000024
representing the x, y axis coordinates of the event point correspondence,
Figure FDA0003607164130000025
for event polarity, a polarity of 1 indicates an increase in the illumination, and a polarity of-1 indicates a decrease in the illumination;
Figure FDA0003607164130000026
for the time stamp of the event point, according to the mapping formula (1), the refocused image frame data set Fr can be obtained:
Figure FDA0003607164130000027
wherein the content of the first and second substances,
Figure FDA0003607164130000028
shown as the nth refocused image frame.
5. The method of claim 4, wherein the event camera and the conventional optical camera are integrated into a synthetic aperture imaging method, and the method comprises the following steps: the refocusing event streaming frame process described in step 3 is represented as:
Figure FDA0003607164130000029
in the formula (2), the first and second groups,
Figure FDA00036071641300000210
shown as the data of the event frame after the jth compressed frame, J is the total number of the event frames, x, y here represents the imageThe coordinates of the position of the object to be imaged,
Figure FDA00036071641300000211
for the image coordinate of the k-th refocusing event point, δ (·) is expressed as a dirichlet function, Δ t is expressed as the time length taken by a single event frame to press a frame, and the calculation method is expressed as:
Figure FDA0003607164130000031
wherein, t k Is the time stamp of the kth event point, t 1 A timestamp of the first event point; therefore, a refocusing event frame data set after framing can be obtained:
Figure FDA0003607164130000032
6. the method of claim 5, wherein the method comprises the steps of: the event stream pre-reconstruction and refocusing process described in step 3 is represented as;
Figure FDA0003607164130000033
in the formula (3), the first and second groups,
Figure FDA0003607164130000034
denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x k ,y k Delta is a Dirichlet function, Recon () is expressed as an event stream brightness reconstruction operator, and delta t is expressed as the time length adopted by a single event frame to press a frame; next, for each image, using the mapping equation (1) described above
Figure FDA0003607164130000035
Mapping is carried out aiming at the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set F E→F,r
Figure FDA0003607164130000036
Wherein the content of the first and second substances,
Figure FDA0003607164130000037
the refocused pre-reconstruction event frame for the jth picture.
7. A method of synthetic aperture imaging fusing an event camera with a conventional optical camera according to claim 6, characterized by: refocusing event frame F after frame pressing input by hybrid neural network model E,r Refocusing the image frame F r And refocusing the event stream to pre-reconstruct the event frame F E→F,r Inputting three branches into a multi-modal coding module for coarse feature extraction:
f F,0 ,f E,0 ,f E→F.0 =MF-Encoder(F r ,F E,r ,F E→F,r )
wherein f is F,0 ,f E,0 ,f E→F,0 Respectively represent from F r ,F E,r ,F E→F,r The coarse features obtained by extraction, MF-Encoder (·) represents multi-mode coding operation, and the coarse features extract three paths of signals respectively; for F r ,F E→F,r Two paths of signals are respectively subjected to feature extraction by using a three-layer convolution layer architecture containing jump connection; for F E,r A signal, feature extraction using three pulse layers including a hopping link;
next, the features are enhanced using a cross-attention enhancement module:
f F,M ,f E,M ,f E→F,M =CME(f F,0 ,f E,0 ,f E→F,O )
wherein f is F,0 ,f E,0 ,f E→F,0 Three-way characteristics, f, output by the multimodal coding module F,M ,f E,M ,f E→F,M For the three-way feature after cross-attention enhancement, CME (-) is a cross-modal enhancement operation, and a Transformer module containing M cross-attention performs multiple enhancement on the feature, and the process is expressed as:
f F,m ,f E,m ,f E→F,m =C-Transformer(f F,m-1 ,f E,m-1 ,f E→F,m-1 ),m∈[1,M]
wherein f is F,m-1 ,f E,m-1 ,f E→F,m-1 For three-way features input into the mth cross attention Transformer module, f F,m ,f E,m ,f E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f F,m ,f E,m ,f E→F,m
Next, performing weighted fusion on the three-way features by using a density perception module:
f ALL =DAF(f F,M ,f E,M ,f E→F,M ,F r ,F E,r ,F E→F,r )
wherein f is F,M ,f E,M ,f E→F,M For a three-way enhanced feature across the output in the attention enhancement module, F r ,F E,r ,F E→F,r For the input three original signals, f ALL DAF (inverse discrete cosine transform) is density perception fusion operation for fused output features, input features are firstly cascaded with corresponding original input on feature dimension and input into a single-layer convolutional layer for feature extraction, and then mean values and sum values of three paths of features are extracted by using global average pooling and global maximum pooling respectivelyMaximum value is input into the full connection layer to calculate weight, and finally, three paths of signals are weighted and summed, and the fused characteristic f is output ALL
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
I recon =MF-Decode(f ALL )
wherein f is ALL Fused features for output of the density sensing module, I recon To reconstruct the resulting luminance image, MF-Encoder (. cndot.) is a multi-modal feature decoding operation.
8. The method of claim 1, wherein the event camera is fused with a conventional optical camera, and the method comprises: the network learning loss function in step 3 is defined as:
Figure FDA0003607164130000041
i in formula (4) recon Reconstructed luminance image for hybrid neural network model, I gt For a true value image corresponding to a target image in the data set,
Figure FDA0003607164130000042
to perceive loss (perceptual loss),
Figure FDA0003607164130000043
is a loss of the norm of L1,
Figure FDA0003607164130000044
expressed as total variation loss (totalvaration), beta per ,β L1 ,β per Respectively, as weights corresponding to the losses.
9. The method of claim 7, wherein the event camera is fused with a conventional optical camera, and the method comprises: the sizes of convolution kernels of the three layers of convolution layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32; the pulse core sizes of the three pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.
10. The method of claim 7, wherein the event camera is fused with a conventional optical camera, and the method comprises: the Transformer module is an existing Swin-Transformer architecture, and the multi-mode feature decoder is an existing ResNet architecture.
CN202210422694.2A 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera Active CN114862732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210422694.2A CN114862732B (en) 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210422694.2A CN114862732B (en) 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera

Publications (2)

Publication Number Publication Date
CN114862732A true CN114862732A (en) 2022-08-05
CN114862732B CN114862732B (en) 2024-04-26

Family

ID=82630677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210422694.2A Active CN114862732B (en) 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera

Country Status (1)

Country Link
CN (1) CN114862732B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115578295A (en) * 2022-11-17 2023-01-06 中国科学技术大学 Video rain removing method, system, equipment and storage medium
CN115761472A (en) * 2023-01-09 2023-03-07 吉林大学 Underwater dim light scene reconstruction method based on fusion event and RGB data
CN116310408A (en) * 2022-11-29 2023-06-23 北京大学 Method and device for establishing data association between event camera and frame camera
CN117745596A (en) * 2024-02-19 2024-03-22 吉林大学 Cross-modal fusion-based underwater de-blocking method
CN117939309A (en) * 2024-03-25 2024-04-26 荣耀终端有限公司 Image demosaicing method, electronic device and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200234414A1 (en) * 2019-01-23 2020-07-23 Inception Institute of Artificial Intelligence, Ltd. Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
CN111667442A (en) * 2020-05-21 2020-09-15 武汉大学 High-quality high-frame-rate image reconstruction method based on event camera
CN112987026A (en) * 2021-03-05 2021-06-18 武汉大学 Event field synthetic aperture imaging algorithm based on hybrid neural network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200234414A1 (en) * 2019-01-23 2020-07-23 Inception Institute of Artificial Intelligence, Ltd. Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures
US20200265590A1 (en) * 2019-02-19 2020-08-20 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning
CN111667442A (en) * 2020-05-21 2020-09-15 武汉大学 High-quality high-frame-rate image reconstruction method based on event camera
CN112987026A (en) * 2021-03-05 2021-06-18 武汉大学 Event field synthetic aperture imaging algorithm based on hybrid neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
张翔等: "Event-based Synthetic Aperture Imaging with a Hybrid Network", IEEE, 2 November 2021 (2021-11-02), pages 14235 - 14242 *
项祎祎等: "基于共焦照明的合成孔径成像方法", 光学学报, vol. 40, no. 08, 30 April 2020 (2020-04-30), pages 73 - 79 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115578295A (en) * 2022-11-17 2023-01-06 中国科学技术大学 Video rain removing method, system, equipment and storage medium
CN116310408A (en) * 2022-11-29 2023-06-23 北京大学 Method and device for establishing data association between event camera and frame camera
CN116310408B (en) * 2022-11-29 2023-10-13 北京大学 Method and device for establishing data association between event camera and frame camera
CN115761472A (en) * 2023-01-09 2023-03-07 吉林大学 Underwater dim light scene reconstruction method based on fusion event and RGB data
CN117745596A (en) * 2024-02-19 2024-03-22 吉林大学 Cross-modal fusion-based underwater de-blocking method
CN117745596B (en) * 2024-02-19 2024-06-11 吉林大学 Cross-modal fusion-based underwater de-blocking method
CN117939309A (en) * 2024-03-25 2024-04-26 荣耀终端有限公司 Image demosaicing method, electronic device and storage medium

Also Published As

Publication number Publication date
CN114862732B (en) 2024-04-26

Similar Documents

Publication Publication Date Title
CN114862732A (en) Synthetic aperture imaging method fusing event camera and traditional optical camera
Lim et al. DSLR: Deep stacked Laplacian restorer for low-light image enhancement
EP4198875A1 (en) Image fusion method, and training method and apparatus for image fusion model
CN112653899B (en) Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene
Rivadeneira et al. Thermal Image Super-resolution: A Novel Architecture and Dataset.
EP2979449B1 (en) Enhancing motion pictures with accurate motion information
CN111091503A (en) Image out-of-focus blur removing method based on deep learning
Xiang et al. Learning super-resolution reconstruction for high temporal resolution spike stream
Haoyu et al. Learning to deblur and generate high frame rate video with an event camera
CN109120931A (en) A kind of streaming media video compression method based on frame-to-frame correlation
CN112446835A (en) Image recovery method, image recovery network training method, device and storage medium
Yang et al. Learning event guided high dynamic range video reconstruction
Xiao et al. Multi-scale attention generative adversarial networks for video frame interpolation
CN115082341A (en) Low-light image enhancement method based on event camera
Shaw et al. Hdr reconstruction from bracketed exposures and events
CN114651270A (en) Depth loop filtering by time-deformable convolution
CN111681236B (en) Target density estimation method with attention mechanism
Xiong et al. Hierarchical fusion for practical ghost-free high dynamic range imaging
CN116843551A (en) Image processing method and device, electronic equipment and storage medium
CN111353982A (en) Depth camera image sequence screening method and device
CN112819742B (en) Event field synthetic aperture imaging method based on convolutional neural network
Sehli et al. WeLDCFNet: Convolutional Neural Network based on Wedgelet Filters and Learnt Deep Correlation Features for depth maps features extraction
WO2023133888A1 (en) Image processing method and apparatus, remote control device, system, and storage medium
CN115661452A (en) Image de-occlusion method based on event camera and RGB image
CN115147317A (en) Point cloud color quality enhancement method and system based on convolutional neural network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant