CN114862732A - Synthetic aperture imaging method fusing event camera and traditional optical camera - Google Patents
Synthetic aperture imaging method fusing event camera and traditional optical camera Download PDFInfo
- Publication number
- CN114862732A CN114862732A CN202210422694.2A CN202210422694A CN114862732A CN 114862732 A CN114862732 A CN 114862732A CN 202210422694 A CN202210422694 A CN 202210422694A CN 114862732 A CN114862732 A CN 114862732A
- Authority
- CN
- China
- Prior art keywords
- event
- image
- frame
- camera
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003384 imaging method Methods 0.000 title claims abstract description 43
- 230000003287 optical effect Effects 0.000 title claims abstract description 22
- 238000000034 method Methods 0.000 claims abstract description 37
- 208000006440 Open Bite Diseases 0.000 claims abstract description 5
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 239000010410 layer Substances 0.000 claims description 30
- 238000013507 mapping Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 15
- 230000008569 process Effects 0.000 claims description 14
- 238000000605 extraction Methods 0.000 claims description 12
- 238000003062 neural network model Methods 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 9
- 230000008447 perception Effects 0.000 claims description 9
- 239000000126 substance Substances 0.000 claims description 9
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 230000003044 adaptive effect Effects 0.000 claims description 6
- 230000006835 compression Effects 0.000 claims description 6
- 238000007906 compression Methods 0.000 claims description 6
- 238000004364 calculation method Methods 0.000 claims description 4
- 239000000284 extract Substances 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000013519 translation Methods 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 238000005286 illumination Methods 0.000 claims 4
- 238000009432 framing Methods 0.000 claims 1
- 238000013528 artificial neural network Methods 0.000 abstract description 17
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000000149 penetrating effect Effects 0.000 abstract 1
- 230000003247 decreasing effect Effects 0.000 description 4
- 238000010586 diagram Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 241000023320 Luma <angiosperm> Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Life Sciences & Earth Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Image Analysis (AREA)
Abstract
The method integrates the advantages of two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, reconstructs a high-quality non-occlusion target image frame, and completes a high-quality penetrating imaging task in multiple dense occlusion scenes. The invention comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning capability of the neural network, thereby realizing the high-quality image reconstruction of the target in various dense shielding scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.
Description
Technical Field
The invention belongs to the field of image processing, and particularly relates to a synthetic aperture imaging method for fusing an event camera and a traditional optical camera.
Background
Synthetic Aperture Imaging (SAI) technology uses a camera to make multi-view observations of a scene, thereby being equivalent to a large-Aperture virtual camera. The larger the aperture of the camera is, the smaller the depth of field is, so that when a shooting target is shielded, the synthetic aperture imaging can realize the imaging of the shielded target by blurring the shielding object, and therefore, the synthetic aperture imaging method has extremely high application value in the fields of three-dimensional reconstruction, target tracking, identification and the like.
Current synthetic aperture imaging algorithms mainly use multi-view image frame sequences taken by conventional optical cameras for perspective imaging. However, as the occlusion is gradually dense, the target ray information contained in the image frame is sharply reduced, and the interference ray information of the occlusion occupies a dominant part, resulting in severe performance degradation of the synthetic aperture imaging algorithm based on the conventional optical camera. In recent years, a synthetic aperture imaging algorithm based on an event camera is proposed to solve the perspective imaging problem in a dense occlusion scene. The event camera senses the brightness change of the pixel points in a logarithmic domain, asynchronously outputs event stream data with high time resolution and high dynamic range, and can sense a target approximately continuously, so that sufficient target information can be obtained in a dense shielding scene. Because the existing event camera synthetic aperture imaging algorithm images based on the event points generated by the brightness difference between the target and the shelter, when in a sparse shelter scene, the performance of the event camera-based synthetic aperture imaging algorithm is degraded due to the reduction of effective event points. Considering that the synthetic aperture imaging algorithm based on the traditional optical camera and the event camera has better performance in a sparse occlusion scene and a dense occlusion scene respectively, the information of the sparse occlusion scene and the dense occlusion scene can be fully utilized for fusion imaging, and the synthetic aperture imaging in various dense occlusion scenes is realized. However, since the data modality of the event stream is completely different from that of the image frame data, how to establish a bridge between the two for fusion imaging still has great difficulty.
Disclosure of Invention
Aiming at the problems, the invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which integrates the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolutional neural network, and reconstructs a high-quality non-occlusion target image frame to finish a high-quality penetration imaging task in occlusion scenes with various dense degrees.
The invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which comprises the following specific steps:
step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene.
Step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;
step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstruction event frame, performing refocusing, inputting the pre-reconstruction event frame and the image frame data set after refocusing into the hybrid neural network as a training set to obtain a target reconstruction image predicted by the network, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis) optimizer by combining a non-occlusion image and a network learning loss function of the target to obtain a trained hybrid neural network;
the hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density perception module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
the cross-attention enhancement module comprises a plurality of cross-attention Transformer modules and is used for enhancing the extracted features for a plurality of times;
the density sensing module is used for performing feature cascade on the enhanced features and the original features, then performing weighted summation on the features through a convolutional layer, a global average pooling layer, a global maximum pooling layer and a full connection layer, and outputting the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing an image according to the fused features;
and 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained network for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.
Further, the multi-view event stream data set E in step 1 is:
E={e k =(x k ,y k ,p k ,t k )},k∈[1,K],x k ∈[1,W],y k ∈[1,H],p k ∈[1,-1]
wherein e is k For the kth event point data, x k ,y k Is the pixel coordinate of the kth event point, p k Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness); t is t k A timestamp of the event point; w and H respectively represent the width and the height of the spatial coordinates of the event points, and K represents the number of the event points.
Further, the multi-view image frame data set F in step 1 is:
wherein, I n The data is represented as the nth image frame data, W and H represent the width and height of the image frame, respectively, and N represents the total number of image frames. Further, the manner of mapping the multi-view event stream image frame to the reference camera position in step 2 is specifically as follows:
[x r ,y r ,1]=KRK -1 [x,y,1]+KT/d (1)
in the formula (1), x r ,y r Representing the mapped image coordinates, x and y representing the image coordinates, R and T representing the rotation and translation matrixes from the pixel points to the reference camera position, K representing the internal reference matrix of the camera, and d representing the internal reference matrix of the cameraThe refocusing distance is generally set to the distance of the occluded object from the camera plane. According to the mapping formula, the event stream data set E after refocusing can be obtained r :
Wherein the content of the first and second substances,for the k-th event point after refocusing,representing the x, y axis coordinates of the event point correspondence,event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness);is the time stamp of the event point. According to the mapping formula (1), the refocused image frame data set F can be obtained r :
Further, the refocusing event stream compression frame process described in step 3 is represented as:
in the formula (2), the first and second groups,the data of the event frame after the jth compression frame is shown, J is the total number of the event frames, x and y represent image coordinates,for the image coordinate of the k-th refocusing event point, δ (·) is expressed as a dirichlet function, Δ t is expressed as the time length taken by a single event frame to press a frame, and the calculation method is expressed as:
wherein, t k Is the time stamp of the kth event point, t 1 Is the timestamp of the first event point. Therefore, a refocusing event frame data set F after frame pressing can be obtained E,r :
Further, the event stream pre-reconstruction process described in step 3 is represented as:
in the formula (3), the first and second groups,denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x k ,y k For the image coordinate of the kth event point, δ (·) is expressed as a dirichlet function, Recon (·) is expressed as an event stream brightness reconstruction operator, the current mainstream event stream reconstruction algorithm is generally used, and Δ t is expressed as the time length taken by a single event frame to press a frame. Next, for each image, using the mapping equation (1) described aboveMapping is carried out aiming at the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set F E→F,r :
Wherein the content of the first and second substances,the refocused pre-reconstruction event frame for the jth picture.
Further, the hybrid neural network model in step 3 includes the following modules: the system comprises a multi-modal coding module, a cross-attention enhancement module, a density perception module and a multi-modal decoding module. Refocusing event frame F after network input frame pressing E,r Refocusing the image frame F r And refocusing the event stream to pre-reconstruct the event frame F E→F,r Firstly, inputting three branches into a multi-mode coding module for coarse feature extraction
f F,0 ,f E,0 ,f E→F,0 =MF-Encoder(F r ,F E,r ,F E→F,r )
Wherein, f F,0 ,f E,0 ,f E→F,0 Respectively represent from F r ,F E,r ,F E→F,r The MF-Encoder (beta) represents a multi-mode encoding operation, and the MF-Encoder (beta) respectively extracts the characteristics of the three signals. For F r ,F E→F,r And two paths of signals are subjected to feature extraction by using a three-layer convolution layer structure containing jump connection. For F E,r And (3) carrying out feature extraction on the signal by using three pulse layers containing jump links.
Next, the features are enhanced using a cross-attention enhancement module:
f F,M ,f E,M ,f E→F,M =CME(f F,0 ,f E,0 ,f E→F,0 )
wherein f is F,0 ,f E,0 ,f E→F,0 For the last step, multi-modal encodingThree-way characteristic of code module output, f F,M ,f E,M ,f E→F,M For the three-way feature after cross-attention enhancement, CME (-) is a cross-modal enhancement operation, and a Transformer module containing M cross-attention performs multiple enhancement on the feature, and the process is expressed as:
f F,m ,f E,m ,f E→F,m =C-Transformer(f F,m-1 ,f E,m-1 ,f E→F,m-1 ),m∈[1,M]
wherein f is F,m-1 ,f E,m-1 ,f E→F,m-1 For three-way features input into the mth cross attention Transformer module, f F,m ,f E,m ,f E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f F,m ,f E,m ,f E→F,m 。
Next, performing weighted fusion on the three-way features by using a density perception module:
f ALL =DAF(f F,M ,f E,M ,f E→F,M ,F r ,F E,r ,F E→F,r )
wherein f is F,M ,f E,M ,f E→F,M For the three-way enhanced feature output in the previous step, F r ,F E,r ,F E→F,r Three original signals, f, for network input ALL DAF (inverse discrete cosine function) is density perception fusion operation for fused output features, the input features are firstly cascaded with corresponding original input on feature dimension and input into a single-layer convolutional layer for feature extraction, then mean values and maximum values of three paths of features are respectively extracted by using global average pooling and global maximum pooling, the three paths of features are input into a full-connection layer for calculating weight values, finally, weighted summation is carried out on three paths of signals, and fused features f are output ALL 。
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
I recon =MF-Decode(f ALL )
wherein f is ALL Fused features, I, output for the last module recon To reconstruct the resulting luminance image, MF-Encoder (·) is a multi-modal feature decoding operation, which generally uses the mainstream convolutional neural network architecture.
i in formula (4) recon Luminance image reconstructed for the network, I gt For a true value image corresponding to a target image in the data set,to perceive loss (perceptual loss),is a loss of the norm of L1,expressed as total variation loss (total variation), which are all known functions commonly used at present, β per ,β L1 ,β per Respectively, as weights corresponding to the losses.
Further, the event stream and image frame data input in step 4 need to be subjected to the same refocusing process as that in step 2, then subjected to the same event stream preprocessing process as that in step 3, and then input into the trained neural network to obtain a corresponding target reconstruction image.
Compared with the prior art, the invention has the advantages that:
the invention provides a synthetic aperture imaging method fusing an event camera and a traditional optical camera, which comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning capability of a neural network, thereby realizing the high-quality image reconstruction of a target in various dense occlusion scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.
Drawings
FIG. 1 is a schematic view of an experimental scene including a shielded target, a shield, a programmable slide rail, and a camera sensor.
FIG. 2 is an overall flow chart of the method of the present invention.
Fig. 3 is a schematic diagram of a process of acquiring data by moving a camera.
Fig. 4 is a comparison graph of image frames, event frames and pre-reconstruction event frames after data preprocessing.
Fig. 5 is a schematic diagram of a hybrid neural network structure.
Fig. 6 is a comparison of the present method with different synthetic aperture imaging methods. The first to four rows from top to bottom are the synthetic aperture imaging results of different dense occlusion scenes. From left to right, the first column is a schematic diagram of an occlusion, the second column is an unobstructed reference image, the third column is a synthetic aperture imaging algorithm (F-SAI + ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI + CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI + ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI + Hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
Detailed Description
In order to clearly understand the present invention, the technical contents of the present invention will be more clearly and completely described below with reference to fig. 1 and an example. It is obvious that the described examples are only a part of the examples of the present invention, and not all of the examples of the implementation. All other examples, which can be obtained by a person skilled in the art without making any creative effort based on the examples in the present invention, belong to the protection scope of the present invention.
The invention relates to a synthetic aperture imaging method for fusing an event camera and a traditional optical camera.
The schematic scenario of the specific implementation of the present invention is shown in fig. 1, and includes: the system comprises a camera sensor, a programmable slide rail, a shelter and a sheltered target;
the camera sensor is a Davis346 event camera, and the camera can synchronously output event streams and image frame data and is used for constructing a corresponding data set;
the Davis346 event camera is mounted on a programmable slide rail that is set to move linearly at a constant speed;
the overall flow chart of the invention is shown in the attached figure 2, and the specific steps are as follows:
step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene, as shown in fig. 3;
the multi-view event stream data E in step 1 is
E={e (k}) =(x k ,y k ,p k ,t k )},k∈[1,K],x k ∈[1,W],y k ∈[1,H],p k ∈[1,-1]
Wherein e is k For the kth event point data, x k ,y k Pixel coordinates of a kth event point; p is a radical of k Event polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness); t is t k A timestamp of the event point; w346, H260 each indicate the width and height of the event point space coordinate, and K indicates the number of event points.
The multi-view image frame data F in step 1 is:
wherein, I n The nth image frame data is shown, W is346, H is 260 represents the width and height of the image frame, respectively, and N is 30 represents the total number of image frames.
Step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;
the manner of mapping the multi-view event stream image frame to the reference camera position described in step 2 is specifically as follows:
[x r ,y r ,1]=KRK -1 [x,y,1]+KT/d (1)
in the formula (1), x r ,y r And (3) representing the mapped image coordinates, R and T represent a rotation matrix and a translation matrix from the pixel point to the position of a reference camera, K represents an internal reference matrix of the camera, and d is a focusing distance which is generally set as the distance from a shielded target to a camera plane. In this example, since the set camera motion mode is a uniform linear motion, it can be considered that the clock does not rotate during the shooting process of the camera, and thus the rotation matrix is a diagonal unit matrix. At time t, the translation matrix calculation mode can be modeled as
T t =[v track (t-t r )0 0]
Wherein v is track 0.0885m/s is the moving speed of the guide rail, t r For each shot, the camera is at a time stamp of the reference position.
According to the mapping formula (1), event stream data E of refocusing can be obtained r :
Wherein the content of the first and second substances,for the k-th event point after focusing,representing the x, y-axis coordinates of the event point correspondencesEvent polarity (polarity 1 represents increased brightness and polarity-1 represents decreased brightness);is the time stamp of the event point. According to the mapping formula (1), the refocused image frame data set F can be obtained r :
Wherein the content of the first and second substances,represented as the nth refocused image frame.
Step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstructed event frame, performing refocusing, inputting the pre-reconstructed event frame and the image frame data set after refocusing into the hybrid neural network as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis) optimizer by combining a non-occlusion image and a network learning loss function of a target to obtain a trained hybrid neural network;
the refocusing event streaming frame process described in step 3 is represented as:
in the formula (2), the first and second groups,the data of the event frame after the jth compression frame is shown, J is the total number of the event frames, x and y represent image coordinates,for the image coordinates of the k-th refocusing event point, δ (·) is expressed as a Dirichlet function, Δ t tableThe time length adopted by the frame pressing of a single event frame is shown, and the calculation mode is shown as follows:
wherein, t k Is the time stamp of the kth event point, t 1 Is the timestamp of the first event point. Therefore, a refocusing event frame data set F after frame pressing can be obtained E,r :
The event stream pre-reconstruction and refocusing process described in step 3 is represented as:
in the formula (3), the first and second groups,denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x k ,y k For the image coordinates of the kth event point, δ (-) is expressed as the dirichlet function, Recon (-) is expressed as the event stream luminance reconstruction operator, in this example, the existing E2VID algorithm is used, and Δ t represents the length of time taken for a single event frame to be framed. Next, for each image, using the mapping equation (1) described aboveMapping the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set:
wherein the content of the first and second substances,for the jth focused pre-reconstructed event image.
Figure 4 shows visually the refocused image frame, the event frame and the pre-reconstructed event frame, respectively.
The structure of the hybrid neural network described in step 3 is shown in fig. 5. The hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-modal coding module, a cross-attention enhancement module, a density perception module and a multi-modal decoding module. Refocusing event frame F after network input frame pressing E,r Refocusing the image frame F r And refocusing the event stream to pre-reconstruct the event frame F E→F,r Inputting three branches into a multi-modal coding module for coarse feature extraction:
f F,0 ,f E,0 ,f E→F,0 =MF-Encoder(F r ,F E,r ,F E→F,r )
wherein f is F,0 ,f E,0 ,f E→F,0 Respectively represent from F r ,F E,r ,F E→F,r The MF-Encoder (beta) represents a multi-mode encoding operation, and the MF-Encoder (beta) respectively extracts the characteristics of the three signals. For F r ,F E→F,r In the embodiment, the sizes of convolution kernels of the three convolutional layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32. For F E,r The signal is subjected to feature extraction by using three pulse layers including jump links, wherein in the embodiment, the pulse core sizes of the three pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.
Next, the features are enhanced using a cross-attention enhancement module:
f F,M ,f E,M ,f E→F,M =CME(f F,0 ,f E,0 ,f E→F,0 )
wherein f is F,0 ,f E,0 ,f E→F,0 For the last step, multi-modal codingThree-way characteristic of module output, f F,M ,f E,M ,f E→F,M For the cross-attention enhanced three-way feature, CME (-) is a cross-modal enhancement operation that includes a transform module with M ═ 6 cross-attentions to enhance features multiple times:
f F,m ,f E,m ,f E→F,m =C-Transformer(f F,m-1 ,f E,m-1 ,f E→F,m-1 ),m∈[1,M]
wherein f is F,m-1 ,f E,m-1 ,f E→F,m-1 For three-way features input into the mth cross attention Transformer module, f F,m ,f E,m ,f E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f F,m ,f E,m ,f E→F,m In this embodiment, the transform module used is the conventional Swin-transform architecture.
Next, performing weighted fusion on the three-way features by using a density perception module:
f ALL =DAF(f F,M ,f E,M ,f E→F,M ,F r ,F E,r ,F E→F,r )
wherein, f F,M ,f E,M ,f E→F,M For the three-way enhanced feature output in the previous step, F r ,F E,r ,F E→F,r Three original signals, f, for network input ALL For the output features after fusion, DAF (·) is a density-aware fusion operation, and the input features are first cascaded with the corresponding original input in feature dimensions and input into a single-layer convolutional layer for feature extraction, where in this embodiment, the size of the convolutional kernel is3, and the number of the convolutional kernels is 32. Then using global average pooling and global maximum pooling to respectively extract average value and maximum value of three-path features, inputting the three-path features into a full-connection layer to calculate weight, and finally carrying out three-path signal processingWeighted summation, outputting the fused feature f ALL 。
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
I recon =MF-Decode(f ALL )
wherein f is ALL Fused features output for the previous module, I recon To reconstruct the resulting luma image, MF-Encoder (. cndot.) is a multi-modal feature decoding operation, which in this embodiment uses the existing ResNet architecture as a multi-modal feature decoder.
The network learning loss function in step 3 is defined as:
i in formula (4) recon Luminance image reconstructed for the network, I gt For a true value image corresponding to a target image in the data set,to perceive loss (perceptual loss),is a loss of the norm of L1,expressed as total variation loss (total variation), which are all known functions commonly used at present, β per ,β L1 ,β per Respectively, as weights corresponding to the losses, in this example, set β per =1,β L1 =32,β per =0.0002。
And 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained network for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.
The event stream and image frame data input in the step 4 need to be subjected to the same refocusing process as that in the step 2, then subjected to the same event stream preprocessing process as that in the step 3, and then input into the trained neural network to obtain a corresponding target reconstruction image.
Figure 6 compares the results of the algorithm of the present invention with other synthetic aperture imaging algorithms in a variety of occluded scenes. The first to four rows from top to bottom are the synthetic aperture imaging results of different dense occlusion scenes. From left to right, the first column is a schematic diagram of an occlusion, the second column is an unobstructed reference image, the third column is a synthetic aperture imaging algorithm (F-SAI + ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI + CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI + ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI + Hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
It should be understood that technical parts not described in detail in the specification belong to the prior art.
It should be understood that the above-described preferred embodiments are illustrative and not restrictive of the scope of the invention, and that various changes, substitutions and alterations can be made herein by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (10)
1. A method for synthetic aperture imaging combining an event camera with a conventional optical camera, comprising the steps of:
step 1, constructing a multi-view event stream and an image frame data set in an occluded scene, and constructing an image frame data set in an unoccluded scene;
step 2, selecting a reference camera position, mapping the multi-view event stream and the image frame to the reference camera position according to a multi-view geometric principle, and refocusing the shielded target to obtain a refocused event stream and image frame data set;
step 3, constructing a hybrid neural network model, performing frame compression on the event stream after refocusing to obtain an event frame, performing pre-reconstruction processing on the event stream which is not focused to obtain a pre-reconstruction event frame, performing refocusing, inputting the pre-reconstruction event frame and the image frame data set after refocusing into the hybrid neural network model as a training set to obtain a target reconstruction image after network prediction, and iteratively optimizing network parameters through an ADAM (adaptive dynamic adaptive analysis model) optimizer by combining a non-occlusion image of a target and a network learning loss function to obtain a trained hybrid neural network model;
the hybrid neural network model in step 3 comprises the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density perception module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
the cross-attention enhancement module comprises a plurality of cross-attention Transformer modules and is used for enhancing the extracted features for a plurality of times;
the density sensing module is used for performing feature cascade on the enhanced features and the original features, then performing weighted summation on the features through a convolutional layer, a global average pooling layer, a global maximum pooling layer and a full connection layer, and outputting the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing an image according to the fused features;
and 4, inputting the event stream of the occluded target to be reconstructed and the image frame into the trained mixed neural network model for prediction to obtain a de-occluded reconstructed image corresponding to the occluded target.
2. The method of claim 1, wherein the method comprises the steps of: the multi-view event stream data set E described in step 1 is represented as;
E={e k =(x k ,y k ,p k ,t k )},k∈[1,K],x k ∈[1,W],y k ∈[1,H],p k ∈[1,-1]
wherein e is k For the kth event point data, x k ,y k Is the pixel coordinate of the kth event point, p k For event polarity, a polarity of 1 indicates an increase in the illumination, and a polarity of-1 indicates a decrease in the illumination; t is t k A timestamp of the event point; w and H respectively represent the width and the height of the spatial coordinates of the event points, and K represents the number of the event points.
3. The method of claim 1, wherein the method comprises the steps of: the multi-view image frame data set F is represented as:
wherein, I n The data is represented as the nth image frame data, W and H represent the width and height of the image frame, respectively, and N represents the total number of image frames.
4. The method of claim 1, wherein the method comprises the steps of: the manner of mapping the multi-view event stream image frame to the reference camera position described in step 2 is specifically as follows:
[x r ,y r ,1]=KRK -1 [x,y,1]+KT/d (1)
in the formula (1), x r ,y r Representing the mapped image coordinates, wherein R and T represent rotation and translation matrixes from pixel points to the position of a reference camera, and x and y represent image coordinates; k represents an internal reference matrix of the camera, and d is a focusing distance;
according to the mapping formula (1), a refocused event stream data set E can be obtained r :
Wherein the content of the first and second substances,for the k-th event point after refocusing,representing the x, y axis coordinates of the event point correspondence,for event polarity, a polarity of 1 indicates an increase in the illumination, and a polarity of-1 indicates a decrease in the illumination;for the time stamp of the event point, according to the mapping formula (1), the refocused image frame data set Fr can be obtained:
5. The method of claim 4, wherein the event camera and the conventional optical camera are integrated into a synthetic aperture imaging method, and the method comprises the following steps: the refocusing event streaming frame process described in step 3 is represented as:
in the formula (2), the first and second groups,shown as the data of the event frame after the jth compressed frame, J is the total number of the event frames, x, y here represents the imageThe coordinates of the position of the object to be imaged,for the image coordinate of the k-th refocusing event point, δ (·) is expressed as a dirichlet function, Δ t is expressed as the time length taken by a single event frame to press a frame, and the calculation method is expressed as:
wherein, t k Is the time stamp of the kth event point, t 1 A timestamp of the first event point; therefore, a refocusing event frame data set after framing can be obtained:
6. the method of claim 5, wherein the method comprises the steps of: the event stream pre-reconstruction and refocusing process described in step 3 is represented as;
in the formula (3), the first and second groups,denoted as the jth pre-reconstructed event frame, J being the total number of pre-reconstructed event frames, x, y here representing the image coordinates, x k ,y k Delta is a Dirichlet function, Recon () is expressed as an event stream brightness reconstruction operator, and delta t is expressed as the time length adopted by a single event frame to press a frame; next, for each image, using the mapping equation (1) described aboveMapping is carried out aiming at the position of a reference camera to obtain a re-focused pre-reconstruction event frame data set F E→F,r :
7. A method of synthetic aperture imaging fusing an event camera with a conventional optical camera according to claim 6, characterized by: refocusing event frame F after frame pressing input by hybrid neural network model E,r Refocusing the image frame F r And refocusing the event stream to pre-reconstruct the event frame F E→F,r Inputting three branches into a multi-modal coding module for coarse feature extraction:
f F,0 ,f E,0 ,f E→F.0 =MF-Encoder(F r ,F E,r ,F E→F,r )
wherein f is F,0 ,f E,0 ,f E→F,0 Respectively represent from F r ,F E,r ,F E→F,r The coarse features obtained by extraction, MF-Encoder (·) represents multi-mode coding operation, and the coarse features extract three paths of signals respectively; for F r ,F E→F,r Two paths of signals are respectively subjected to feature extraction by using a three-layer convolution layer architecture containing jump connection; for F E,r A signal, feature extraction using three pulse layers including a hopping link;
next, the features are enhanced using a cross-attention enhancement module:
f F,M ,f E,M ,f E→F,M =CME(f F,0 ,f E,0 ,f E→F,O )
wherein f is F,0 ,f E,0 ,f E→F,0 Three-way characteristics, f, output by the multimodal coding module F,M ,f E,M ,f E→F,M For the three-way feature after cross-attention enhancement, CME (-) is a cross-modal enhancement operation, and a Transformer module containing M cross-attention performs multiple enhancement on the feature, and the process is expressed as:
f F,m ,f E,m ,f E→F,m =C-Transformer(f F,m-1 ,f E,m-1 ,f E→F,m-1 ),m∈[1,M]
wherein f is F,m-1 ,f E,m-1 ,f E→F,m-1 For three-way features input into the mth cross attention Transformer module, f F,m ,f E,m ,f E→F,m For the output three-way characteristics, C-Transformer (DEG) represents the characteristic enhancement operation of a cross attention Transformer, firstly normalizing each signal, calculating respective attention information, realizing cross-modal information enhancement by adding and fusing the attention information of the three-way signals, performing characteristic enhancement by using a multilayer perceptron, and outputting the enhanced characteristics f F,m ,f E,m ,f E→F,m ;
Next, performing weighted fusion on the three-way features by using a density perception module:
f ALL =DAF(f F,M ,f E,M ,f E→F,M ,F r ,F E,r ,F E→F,r )
wherein f is F,M ,f E,M ,f E→F,M For a three-way enhanced feature across the output in the attention enhancement module, F r ,F E,r ,F E→F,r For the input three original signals, f ALL DAF (inverse discrete cosine transform) is density perception fusion operation for fused output features, input features are firstly cascaded with corresponding original input on feature dimension and input into a single-layer convolutional layer for feature extraction, and then mean values and sum values of three paths of features are extracted by using global average pooling and global maximum pooling respectivelyMaximum value is input into the full connection layer to calculate weight, and finally, three paths of signals are weighted and summed, and the fused characteristic f is output ALL ;
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
I recon =MF-Decode(f ALL )
wherein f is ALL Fused features for output of the density sensing module, I recon To reconstruct the resulting luminance image, MF-Encoder (. cndot.) is a multi-modal feature decoding operation.
8. The method of claim 1, wherein the event camera is fused with a conventional optical camera, and the method comprises: the network learning loss function in step 3 is defined as:
i in formula (4) recon Reconstructed luminance image for hybrid neural network model, I gt For a true value image corresponding to a target image in the data set,to perceive loss (perceptual loss),is a loss of the norm of L1,expressed as total variation loss (totalvaration), beta per ,β L1 ,β per Respectively, as weights corresponding to the losses.
9. The method of claim 7, wherein the event camera is fused with a conventional optical camera, and the method comprises: the sizes of convolution kernels of the three layers of convolution layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32; the pulse core sizes of the three pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.
10. The method of claim 7, wherein the event camera is fused with a conventional optical camera, and the method comprises: the Transformer module is an existing Swin-Transformer architecture, and the multi-mode feature decoder is an existing ResNet architecture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210422694.2A CN114862732B (en) | 2022-04-21 | 2022-04-21 | Synthetic aperture imaging method integrating event camera and traditional optical camera |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210422694.2A CN114862732B (en) | 2022-04-21 | 2022-04-21 | Synthetic aperture imaging method integrating event camera and traditional optical camera |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114862732A true CN114862732A (en) | 2022-08-05 |
CN114862732B CN114862732B (en) | 2024-04-26 |
Family
ID=82630677
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210422694.2A Active CN114862732B (en) | 2022-04-21 | 2022-04-21 | Synthetic aperture imaging method integrating event camera and traditional optical camera |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114862732B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115578295A (en) * | 2022-11-17 | 2023-01-06 | 中国科学技术大学 | Video rain removing method, system, equipment and storage medium |
CN115761472A (en) * | 2023-01-09 | 2023-03-07 | 吉林大学 | Underwater dim light scene reconstruction method based on fusion event and RGB data |
CN116310408A (en) * | 2022-11-29 | 2023-06-23 | 北京大学 | Method and device for establishing data association between event camera and frame camera |
CN117745596A (en) * | 2024-02-19 | 2024-03-22 | 吉林大学 | Cross-modal fusion-based underwater de-blocking method |
CN117939309A (en) * | 2024-03-25 | 2024-04-26 | 荣耀终端有限公司 | Image demosaicing method, electronic device and storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200234414A1 (en) * | 2019-01-23 | 2020-07-23 | Inception Institute of Artificial Intelligence, Ltd. | Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures |
US20200265590A1 (en) * | 2019-02-19 | 2020-08-20 | The Trustees Of The University Of Pennsylvania | Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning |
CN111667442A (en) * | 2020-05-21 | 2020-09-15 | 武汉大学 | High-quality high-frame-rate image reconstruction method based on event camera |
CN112987026A (en) * | 2021-03-05 | 2021-06-18 | 武汉大学 | Event field synthetic aperture imaging algorithm based on hybrid neural network |
-
2022
- 2022-04-21 CN CN202210422694.2A patent/CN114862732B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200234414A1 (en) * | 2019-01-23 | 2020-07-23 | Inception Institute of Artificial Intelligence, Ltd. | Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures |
US20200265590A1 (en) * | 2019-02-19 | 2020-08-20 | The Trustees Of The University Of Pennsylvania | Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning |
CN111667442A (en) * | 2020-05-21 | 2020-09-15 | 武汉大学 | High-quality high-frame-rate image reconstruction method based on event camera |
CN112987026A (en) * | 2021-03-05 | 2021-06-18 | 武汉大学 | Event field synthetic aperture imaging algorithm based on hybrid neural network |
Non-Patent Citations (2)
Title |
---|
张翔等: "Event-based Synthetic Aperture Imaging with a Hybrid Network", IEEE, 2 November 2021 (2021-11-02), pages 14235 - 14242 * |
项祎祎等: "基于共焦照明的合成孔径成像方法", 光学学报, vol. 40, no. 08, 30 April 2020 (2020-04-30), pages 73 - 79 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115578295A (en) * | 2022-11-17 | 2023-01-06 | 中国科学技术大学 | Video rain removing method, system, equipment and storage medium |
CN116310408A (en) * | 2022-11-29 | 2023-06-23 | 北京大学 | Method and device for establishing data association between event camera and frame camera |
CN116310408B (en) * | 2022-11-29 | 2023-10-13 | 北京大学 | Method and device for establishing data association between event camera and frame camera |
CN115761472A (en) * | 2023-01-09 | 2023-03-07 | 吉林大学 | Underwater dim light scene reconstruction method based on fusion event and RGB data |
CN117745596A (en) * | 2024-02-19 | 2024-03-22 | 吉林大学 | Cross-modal fusion-based underwater de-blocking method |
CN117745596B (en) * | 2024-02-19 | 2024-06-11 | 吉林大学 | Cross-modal fusion-based underwater de-blocking method |
CN117939309A (en) * | 2024-03-25 | 2024-04-26 | 荣耀终端有限公司 | Image demosaicing method, electronic device and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114862732B (en) | 2024-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114862732A (en) | Synthetic aperture imaging method fusing event camera and traditional optical camera | |
Lim et al. | DSLR: Deep stacked Laplacian restorer for low-light image enhancement | |
EP4198875A1 (en) | Image fusion method, and training method and apparatus for image fusion model | |
CN112653899B (en) | Network live broadcast video feature extraction method based on joint attention ResNeSt under complex scene | |
Rivadeneira et al. | Thermal Image Super-resolution: A Novel Architecture and Dataset. | |
EP2979449B1 (en) | Enhancing motion pictures with accurate motion information | |
CN111091503A (en) | Image out-of-focus blur removing method based on deep learning | |
Xiang et al. | Learning super-resolution reconstruction for high temporal resolution spike stream | |
Haoyu et al. | Learning to deblur and generate high frame rate video with an event camera | |
CN109120931A (en) | A kind of streaming media video compression method based on frame-to-frame correlation | |
CN112446835A (en) | Image recovery method, image recovery network training method, device and storage medium | |
Yang et al. | Learning event guided high dynamic range video reconstruction | |
Xiao et al. | Multi-scale attention generative adversarial networks for video frame interpolation | |
CN115082341A (en) | Low-light image enhancement method based on event camera | |
Shaw et al. | Hdr reconstruction from bracketed exposures and events | |
CN114651270A (en) | Depth loop filtering by time-deformable convolution | |
CN111681236B (en) | Target density estimation method with attention mechanism | |
Xiong et al. | Hierarchical fusion for practical ghost-free high dynamic range imaging | |
CN116843551A (en) | Image processing method and device, electronic equipment and storage medium | |
CN111353982A (en) | Depth camera image sequence screening method and device | |
CN112819742B (en) | Event field synthetic aperture imaging method based on convolutional neural network | |
Sehli et al. | WeLDCFNet: Convolutional Neural Network based on Wedgelet Filters and Learnt Deep Correlation Features for depth maps features extraction | |
WO2023133888A1 (en) | Image processing method and apparatus, remote control device, system, and storage medium | |
CN115661452A (en) | Image de-occlusion method based on event camera and RGB image | |
CN115147317A (en) | Point cloud color quality enhancement method and system based on convolutional neural network |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |