CN114862732B - Synthetic aperture imaging method integrating event camera and traditional optical camera - Google Patents
Synthetic aperture imaging method integrating event camera and traditional optical camera Download PDFInfo
- Publication number
- CN114862732B CN114862732B CN202210422694.2A CN202210422694A CN114862732B CN 114862732 B CN114862732 B CN 114862732B CN 202210422694 A CN202210422694 A CN 202210422694A CN 114862732 B CN114862732 B CN 114862732B
- Authority
- CN
- China
- Prior art keywords
- event
- image
- camera
- synthetic aperture
- frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000003384 imaging method Methods 0.000 title claims abstract description 52
- 230000003287 optical effect Effects 0.000 title claims abstract description 26
- 230000002708 enhancing effect Effects 0.000 claims abstract description 4
- 239000010410 layer Substances 0.000 claims description 33
- 238000000034 method Methods 0.000 claims description 24
- 238000013507 mapping Methods 0.000 claims description 18
- 230000006870 function Effects 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 14
- 238000003062 neural network model Methods 0.000 claims description 12
- 230000008569 process Effects 0.000 claims description 12
- 230000004927 fusion Effects 0.000 claims description 11
- 238000011176 pooling Methods 0.000 claims description 10
- 230000006835 compression Effects 0.000 claims description 9
- 238000007906 compression Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 208000006440 Open Bite Diseases 0.000 claims description 7
- 238000013527 convolutional neural network Methods 0.000 claims description 7
- 230000007423 decrease Effects 0.000 claims description 7
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012545 processing Methods 0.000 claims description 4
- 238000013519 translation Methods 0.000 claims description 4
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 claims description 3
- 238000009432 framing Methods 0.000 claims description 3
- 239000002356 single layer Substances 0.000 claims description 3
- 238000012549 training Methods 0.000 claims description 3
- 230000008447 perception Effects 0.000 claims description 2
- 238000013528 artificial neural network Methods 0.000 abstract description 18
- 230000000149 penetrating effect Effects 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 6
- 238000007781 pre-processing Methods 0.000 description 3
- 238000009825 accumulation Methods 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000015556 catabolic process Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000035515 penetration Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/50—Image enhancement or restoration using two or more images, e.g. averaging or subtraction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20212—Image combination
- G06T2207/20221—Image fusion; Image merging
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a synthetic aperture imaging method for fusing an event camera with a traditional optical camera, which fuses the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, and reconstructs a high-quality non-shielding target image frame to finish high-quality penetrating imaging tasks in shielding scenes with various densities. The invention comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning ability of the neural network, thereby realizing the high-quality image reconstruction of the target in various dense shielding scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.
Description
Technical Field
The invention belongs to the field of image processing, and particularly relates to a synthetic aperture imaging method for fusing an event camera with a traditional optical camera.
Background
Synthetic aperture imaging (SYNTHETIC APERTURE IMAGING, SAI) techniques use cameras to make multi-view observations of a scene, thereby equating to one large aperture virtual camera. Because the larger the aperture of the camera is and the smaller the depth of field is, when the shooting target is shielded, the imaging of the shielded target can be realized by blurring the shielding object by the synthetic aperture imaging, so that the method has extremely high application value in the fields of three-dimensional reconstruction, target tracking, identification and the like.
Current synthetic aperture imaging algorithms mainly use a sequence of multi-view image frames taken by a conventional optical camera for perspective imaging. However, as the obscuration becomes denser, the target light information contained in the image frames decreases dramatically, and the interfering light information of the obscuration dominates, resulting in serious performance degradation of the traditional optical camera-based synthetic aperture imaging algorithm. In recent years, a synthetic aperture imaging algorithm based on event cameras has been proposed to solve the perspective imaging problem in dense occlusion scenes. The event camera senses the brightness change of the pixel point in the logarithmic domain, asynchronously outputs event stream data with high time resolution and high dynamic range, and can sense the target approximately continuously, so that sufficient target information can be obtained in a dense shielding scene. Because the existing synthetic aperture imaging algorithm of the event camera images event points generated based on the brightness difference of the target and the shielding object, when in a sparse shielding scene, the effective event points are reduced, so that the performance of the synthetic aperture imaging algorithm based on the event camera is degraded. Considering that the traditional optical camera and event camera based synthetic aperture imaging algorithm have better performance in sparse shielding scenes and dense shielding scenes respectively, the information of the two can be fully utilized for fusion imaging, and synthetic aperture imaging in various dense shielding scenes is realized. However, since the data modality of the event stream is completely different from that of the image frame data, there is still a great difficulty in how to establish a bridge between the two to realize fusion imaging.
Disclosure of Invention
Aiming at the problems, the invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which combines the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, and reconstructs a high-quality non-occlusion target image frame to finish high-quality penetration imaging tasks in multiple intensive occlusion scenes.
The invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which comprises the following specific steps:
step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing the image frame data set in a non-occlusion scene.
Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;
Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, performing pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network as a training set to obtain a target reconstructed image of network prediction, and combining a non-occlusion image of the target and a network learning loss function, and performing iterative optimization on network parameters through an ADAM optimizer to obtain the trained hybrid neural network;
the hybrid neural network model described in step3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
The cross-attention enhancement module comprises a plurality of cross-attention transducer modules for enhancing the extracted features for a plurality of times;
the density sensing module comprises a feature cascade of enhanced features and original features, then a convolution layer, global average pooling, global maximum pooling and a full connection layer are adopted, and finally weighted summation is carried out to output the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing images according to the fused characteristics;
and 4, inputting the event stream of the blocked target to be reconstructed and the image frame into the trained network for prediction to obtain a de-blocked reconstructed image corresponding to the blocked target.
Further, the multi-view event stream data set E in step 1 is:
E={ek=(xk,yk,pk,tk)},k∈[1,K],xk∈[1,W],yk∈[1,H],pk∈[1,-1]
Wherein e k is the kth event point data, x k,yk is the pixel coordinates of the kth event point, and p k is the event polarity (polarity 1 indicates that the luminance increases and polarity-1 indicates that the luminance decreases); t k is the timestamp of the event point; w, H respectively represent the width and height of the space coordinates of the event points, and K represents the number of the event points.
Further, the multi-view image frame data set F in step 1 is:
Wherein I n denotes nth image frame data, W, H denote width and height of an image frame, respectively, and N denotes the total number of image frames. Further, the mapping of the multi-view event stream image frame to the reference camera position in step 2 is specifically as follows:
[xr,yr,1]=KRK-1[x,y,1]+KT/d (1)
In the formula (1), x r,yr represents the mapped image coordinates, x, y represents the image coordinates, R, T represents a rotation and translation matrix from the pixel point to the reference camera position, K represents an internal reference matrix of the camera, d is a refocusing distance, and generally is set as a distance from the blocked target to the camera plane. According to the mapping formula, the refocused event stream data set E r can be obtained:
wherein, K-th event point after refocusing,/>Representing the x, y axis coordinates corresponding to the event point,/>For event polarity (polarity 1 representing the increase in the light intensity and polarity-1 representing the decrease in the light intensity); /(I)Is the timestamp of the event point. From the mapping formula (1), the refocused image frame data set F r can be obtained:
wherein, Represented as the n Zhang Chong th focused image frame.
Further, the refocusing event stream framing procedure described in step 3 is expressed as:
In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:
Where t k is the timestamp of the kth event point and t 1 is the timestamp of the first event point. Thus, a refocusing event frame data set F E,r after the frame compression can be obtained:
further, the event stream pre-reconstruction process described in step 3 is expressed as:
In the formula (3), The J is represented by a J-th pre-reconstruction event frame, J is the total number of pre-reconstruction event frames, x and y represent image coordinates herein, x k,yk is the image coordinates of a k-th event point, delta (·) is represented by a dirichlet function, recon (·) is represented by an event stream brightness reconstruction operator, a currently mainstream event stream reconstruction algorithm is generally used, and delta t is represented by the time length adopted by a single event frame compression frame. Next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set F E→F,r after refocusing is obtained:
wherein, A pre-reconstructed event frame focused for the j Zhang Chong th.
Further, the hybrid neural network model in step 3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module. Refocus event frame F E,r after network input frame compression, refocus Jiao Tu frame F r and refocus event stream pre-reconstruction event frame F E→F,r are firstly input into a multi-mode coding module by three branches for coarse feature extraction
fF,0,fE,0,fE→F,0=MF-Encoder(Fr,FE,r,FE→F,r)
Wherein F F,0,fE,0,fE→F,0 represents the coarse features extracted from F r,FE,r,FE→F,r, and MF-Encoder (·) represents the multi-mode encoding operation, which performs feature extraction on the three signals, respectively. For both signals of F r,FE→F,r, feature extraction was performed using a three-layer convolutional layer architecture containing a jump connection. For the F E,r signal, feature extraction was performed using a three-layer pulse layer containing jump links.
Next, the features are enhanced using a cross-attention enhancement module:
fF,M,fE,M,fE→F,M=CME(fF,0,fE,0,fE→F,0)
Wherein, f F,0,fE,0,fE→F,0 is three-way characteristics output by the multi-mode coding module in the last step, f F,M,fE,M,fE→F,M is three-way characteristics after cross-attention enhancement, CME (·) is cross-mode enhancement operation, and the characteristics are enhanced for a plurality of times by a transducer module containing M cross-attention, and the process is expressed as:
fF,m,fE,m,fE→F,m=C-Transformer(fF,m-1,fE,m-1,fE→F,m-1),m∈[1,M]
Wherein f F,m-1,fE,m-1,fE→F,m-1 is three-way characteristics input into the mth cross-attention transducer module, f F,m,fE,m,fE→F,m is three-way characteristics output, C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, the operation is performed on each path of signals respectively, the normalization is performed on each path of signals, the respective attention information is calculated, the cross-modal information enhancement is realized by adding and fusing the attention information of the three paths of signals, the characteristic enhancement is performed by utilizing a multi-layer perceptron, and the enhanced characteristic f F,m,fE,m,fE→F,m is output.
Next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:
fALL=DAF(fF,M,fE,M,fE→F,M,Fr,FE,r,FE→F,r)
Wherein F F,M,fE,M,fE→F,M is three enhanced features output in the previous step, F r,FE,r,FE→F,r is three original signals input by a network, F ALL is a feature output after fusion, DAF (-) is a density perception fusion operation, the input features are cascaded on feature dimensions with corresponding original inputs, the features are input into a single-layer convolution layer for feature extraction, then a global average pooling and a global maximum pooling are used for extracting an average value and a maximum value of the three features respectively, the three features are input into a full-connection layer for calculating a weight value, finally weighted summation is carried out on the three signals, and the fused features F ALL are output.
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
Irecon=MF-Decode(fALL)
Wherein f ALL is the fused feature output by the previous module, I recon is the reconstructed luminance image, MF-Encoder (·) is a multi-mode feature decoding operation, which generally uses a mainstream convolutional neural network architecture.
Further, the learning loss function in step 3The definition is as follows:
I recon in the formula (4) is a brightness image reconstructed by the network, I gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), the above-mentioned loss functions are all known functions commonly used at present, and beta per,βL1,βper is expressed as the weight of the corresponding loss respectively.
Furthermore, the input event stream and the image frame data in the step 4 need to be subjected to the same refocusing process as that in the step 2, then the same event stream preprocessing process in the step 3 is performed, and then the trained neural network is input, so that a corresponding target reconstruction image can be obtained.
Compared with the prior art, the invention has the advantages that:
The invention provides a synthetic aperture imaging method for fusing an event camera and a traditional optical camera, which comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning ability of a neural network, thereby realizing high-quality image reconstruction of a target in various dense shielding scenes and further enhancing the robustness and applicability of the synthetic aperture imaging technology.
Drawings
FIG. 1 is a schematic diagram of an experimental scenario including an occluded object, an occlusion object, a programmable sled, and a camera sensor.
FIG. 2 is a flow chart of the overall process of the present invention.
Fig. 3 is a schematic diagram of a camera moving data acquisition process.
Fig. 4 is a contrast diagram of an image frame, an event frame and a pre-reconstruction event frame after data preprocessing.
Fig. 5 is a schematic diagram of a hybrid neural network structure.
FIG. 6 is a graph comparing the present method with different synthetic aperture imaging methods. And (3) the synthetic aperture imaging results of densely-occluded scenes with different behaviors from top to bottom. The first column from left to right is an occlusion diagram, the second column is an occlusion-free reference image, the third column is a synthetic aperture imaging algorithm (F-SAI+ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI+CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI+ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI+hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
Detailed Description
In order to more clearly understand the present invention, the technical contents of the present invention will be more clearly and completely described in the following description of an example in conjunction with fig. 1. It is apparent that the described examples are only some, but not all, of the embodiments of the invention. All other examples, based on examples in this invention, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the invention.
The specific facts of the invention are a synthetic aperture imaging method which fuses an event camera with a traditional optical camera.
A specific implementation schematic scene of the invention is shown in figure 1 of the accompanying drawings, and comprises the following steps: the camera sensor, the programmable slide rail, shelter from thing and sheltered from the goal;
The camera sensor model is a Davis346 event camera, and the camera can synchronously output event stream and image frame data and is used for constructing a corresponding data set;
the Davis346 event camera is fixed on a programmable slide rail which is set to move linearly at a constant speed;
the whole flow chart of the invention is shown in figure 2 of the accompanying drawings, and the specific steps are as follows:
step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing an image frame data set in a non-occlusion scene, as shown in fig. 3;
The multiview event stream data E described in step 1 is
E={e(k})=(xk,yk,pk,tk)},k∈[1,K],xk∈[1,W],yk∈[1,H],pk∈[1,-1]
Wherein e k is the kth event point data, x k,yk is the pixel coordinates of the kth event point; p k is the event polarity (polarity 1 represents the increase in the light intensity, and polarity-1 represents the decrease in the light intensity); t k is the timestamp of the event point; w=346, h=260 represents the width and height of the event point space coordinates, respectively, and K represents the number of event points.
The multi-view image frame data F in step 1 is:
I n is the nth image frame data, w=346, h=260 is the width and height of the image frame, and n=30 is the total number of image frames.
Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;
the method for mapping the multi-view event stream image frame to the reference camera position in the step 2 is specifically as follows:
[xr,yr,1]=KRK-1[x,y,1]+KT/d (1)
In formula (1), x r,yr represents the mapped image coordinates, R, T represents the rotation and translation matrix from the pixel point to the reference camera position, K represents the internal reference matrix of the camera, d is the focusing distance, and generally the distance from the blocked target to the camera plane is set. In this routine, since the camera movement mode is set to be uniform linear movement, it can be considered that the clock is not rotated during the photographing process of the camera, and thus the rotation matrix is a pair of angular unit matrices. At time t, the translation matrix calculation mode can be modeled as
Tt=[vtrack(t-tr)0 0]
Wherein v track = 0.0885m/s is the movement speed of the guide rail, and t r is the time stamp of the camera at the reference position when each shooting.
From the mapping formula (1), refocused event stream data E r can be obtained:
wherein, For the k-th event point after focusing,/>Representing the x, y-axis coordinates/>, corresponding to the event pointFor event polarity (polarity 1 representing the increase in the light intensity and polarity-1 representing the decrease in the light intensity); /(I)Is the timestamp of the event point. From the mapping formula (1), the refocused image frame data set F r can be obtained:
wherein, Represented as the nth refocused image frame.
Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, carrying out pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM optimizer by combining a non-occlusion image of the target and a network learning loss function to obtain a trained hybrid neural network;
the refocusing event stream framing procedure described in step 3 is expressed as:
In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:
Where t k is the timestamp of the kth event point and t 1 is the timestamp of the first event point. Thus, a refocusing event frame data set F E,r after the frame compression can be obtained:
The process of pre-reconstructing and refocusing the event stream described in step 3 is expressed as:
In the formula (3), Expressed as J-th pre-reconstructed event frame, J is the total number of pre-reconstructed event frames, x, y represent image coordinates here, x k,yk are the image coordinates of the kth event point, delta (·) is expressed as dirichlet function, recon (·) is expressed as event stream brightness reconstruction operator, the existing E2VID algorithm is used in this routine, and Δt is expressed as the time length adopted by the single event frame compression frame. Next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set after refocusing is obtained:
wherein, And (3) the j-th focused pre-reconstructed event image.
The refocused image frames, event frames, and pre-reconstructed event frames are each visually illustrated in fig. 4 of the drawings.
The structure of the hybrid neural network in the step 3 is shown in fig. 5 of the accompanying drawings. The hybrid neural network model described in step 3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module. Refocus event frame F E,r after frame compression input by the network, refocus Jiao Tu frame F r and refocus event stream pre-reconstruction event frame F E→F,r are input into a multi-mode coding module by three branches for coarse feature extraction:
fF,0,fE,0,fE→F,0=MF-Encoder(Fr,FE,r,FE→F,r)
Wherein F F,0,fE,0,fE→F,0 represents the coarse features extracted from F r,FE,r,FE→F,r, and MF-Encoder (·) represents the multi-mode encoding operation, which performs feature extraction on the three signals, respectively. For both signals F r,FE→F,r, a three-layer convolution layer architecture including jump connection is used for feature extraction, and in this embodiment, the convolution kernels of the three-layer convolution layer are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32. For the F E,r signal, feature extraction is performed using three pulse layers including jump links, and in this embodiment, the pulse kernel sizes of the three pulse layers are 1, 3, and 7, and the pulse kernel numbers are 8, 16, and 32.
Next, the features are enhanced using a cross-attention enhancement module:
fF,M,fE,M,fE→F,M=CME(fF,0,fE,0,fE→F,0)
Wherein, f F,0,fE,0,fE→F,0 is three-way characteristics output by the multi-mode coding module in the last step, f F,M,fE,M,fE→F,M is three-way characteristics after cross-attention enhancement, CME (·) is cross-mode enhancement operation, which includes multiple enhancement of the characteristics by m=6 cross-attention transducer modules:
fF,m,fE,m,fE→F,m=C-Transformer(fF,m-1,fE,m-1,fE→F,m-1),m∈[1,M]
Wherein f F,m-1,fE,m-1,fE→F,m-1 is three-way characteristics of the m-th cross-attention transducer module, f F,m,fE,m,fE→F,m is three-way characteristics of the output, and C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, which normalizes each path of signal first and calculates respective attention information, adds and fuses the attention information of the three paths of signals to realize cross-modal information enhancement, and utilizes a multi-layer perceptron to perform characteristic enhancement, and outputs the enhanced characteristic f F,m,fE,m,fE→F,m.
Next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:
fALL=DAF(fF,M,fE,M,fE→F,M,Fr,FE,r,FE→F,r)
Wherein F F,M,fE,M,fE→F,M is three enhanced features output in the previous step, F r,FE,r,FE→F,r is three original signals input by a network, F ALL is a feature output after fusion, DAF (·) is a density sensing fusion operation, the input features are cascaded with the corresponding original input in feature dimension, and are input into a single-layer convolution layer for feature extraction, in this embodiment, the convolution kernel size is 3, and the number of convolution kernels is 32. And then, respectively extracting the average value and the maximum value of the three paths of features by using global average pooling and global maximum pooling, inputting the three paths of features into a full-connection layer to calculate a weight, and finally, carrying out weighted summation on the three paths of signals to output the fused feature f ALL.
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
Irecon=MF-Decode(fALL)
Wherein f ALL is the fused feature output by the previous module, I recon is the reconstructed luminance image, MF-Encoder (·) is the multi-mode feature decoding operation, and in this embodiment, the existing ResNet architecture is used as the multi-mode feature decoder.
The learning loss function of the network in the step 3 is defined as:
I recon in the formula (4) is a brightness image reconstructed by the network, I gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), the above-mentioned loss functions are all known functions commonly used at present, and β per,βL1,βper is expressed as the weight of the corresponding loss, respectively, and in this routine, β per=1,βL1=32,βper =0.0002 is set.
And 4, inputting the event stream of the blocked target to be reconstructed and the image frame into the trained network for prediction to obtain a de-blocked reconstructed image corresponding to the blocked target.
The input event stream and image frame data in the step 4 firstly need to carry out the refocusing process the same as that in the step 2, then carry out the preprocessing process of the event stream the same as that in the step 3, and then input the trained neural network, thus obtaining the corresponding target reconstruction image.
Figure 6 compares the results of the algorithm of the present invention with other synthetic aperture imaging algorithms in a variety of occlusion scenes. And (3) the synthetic aperture imaging results of densely-occluded scenes with different behaviors from top to bottom. The first column from left to right is an occlusion diagram, the second column is an occlusion-free reference image, the third column is a synthetic aperture imaging algorithm (F-SAI+ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI+CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI+ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI+hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
It should be understood that technical portions that are not specifically set forth in the present specification are all prior art.
It should be understood that the foregoing description of the preferred embodiments is not to be construed as limiting the scope of the invention, but that substitutions and modifications can be made by one of ordinary skill in the art without departing from the scope of the invention as defined by the appended claims.
Claims (9)
1. A synthetic aperture imaging method for fusing an event camera with a conventional optical camera, comprising the steps of:
Step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing an image frame data set in a non-occlusion scene;
Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;
Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, performing pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network model as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM optimizer by combining a non-occlusion image of the target and a network learning loss function to obtain a trained hybrid neural network model;
the hybrid neural network model described in step3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
The cross-attention enhancement module comprises a plurality of cross-attention transducer modules for enhancing the extracted features for a plurality of times;
the density sensing module comprises a feature cascade of enhanced features and original features, then a convolution layer, global average pooling, global maximum pooling and a full connection layer are adopted, and finally weighted summation is carried out to output the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing images according to the fused characteristics;
Step 4, inputting the shielded target event stream to be reconstructed and the image frame into a trained hybrid neural network model for prediction to obtain a de-shielded reconstructed image corresponding to the shielded target;
Refocus event frame F E,r after frame compression input by the hybrid neural network model, refocus Jiao Tu frame F r and refocus event stream pre-reconstruction event frame F E→F,r are input into a multi-mode coding module by three branches for coarse feature extraction:
fF,0,fE,0,fE→F,0=MF-Encoder(Fr,FE,r,FE→F,r)
Wherein F F,0,fE,0,fE→F,0 represents the coarse features extracted from F r,FE,r,FE→F,r, and MF-Encoder (DEG) represents the multi-mode encoding operation, which performs feature extraction on three paths of signals; for two paths of signals F r,FE→F,r, a three-layer convolution layer architecture containing jump connection is used for extracting features; for the F E,r signal, performing feature extraction by using three pulse layers containing jump links;
next, the features are enhanced using a cross-attention enhancement module:
fF,M,fE,M,fE→F,M=CME(fF,0,fE,0,fE→F,0)
Wherein, f F,0,fE,0,fE→F,0 is three-way characteristics output by the multi-mode coding module, f F,M,fE,M,fE→F,M is three-way characteristics after cross-attention enhancement, CME (-) is cross-mode enhancement operation, and the characteristics are enhanced for a plurality of times by a transducer module containing M cross-attention, and the process is expressed as follows:
fF,m,fE,m,fE→F,m=C-Transformer(fF,m-1,fE,m-1,fE→F,m-1),m∈[1,M]
Wherein, f F,m-1,fE,m-1,fE→F,m-1 is three-way characteristics input into an mth cross-attention transducer module, f F,m,fE,m,fE→F,m is three-way characteristics output, C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, each path of signal is normalized firstly, the respective attention information is calculated, the cross-modal information enhancement is realized by adding and fusing the attention information of the three paths of signals, the characteristic enhancement is carried out by utilizing a multi-layer perceptron, and the enhanced characteristic f F,m,fE,m,fE→F,m is output;
next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:
fALL=DAF(fF,M,fE,M,fE→F,M,Fr,FE,r,FE→F,r)
Wherein F F,M,fE,M,fE→F,M is three paths of enhanced features output in the cross-attention enhancement module, F r,FE,r,FE→F,r is three paths of input original signals, F ALL is a feature output after fusion, DAF (-) is a density perception fusion operation, the input features are cascaded in feature dimension with corresponding original input, the features are input into a single-layer convolution layer for feature extraction, then three paths of features are respectively extracted into a mean value and a maximum value by global average pooling and global maximum pooling, the three paths of features are input into a full connection layer for calculating weight, and finally weighted summation is carried out on the three paths of signals, and the fused features F ALL are output;
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
Irecon=MF-Decode(fALL)
Wherein f ALL is the fused feature output by the density sensing module, I recon is the reconstructed brightness image, and MF-Decoder (-) is the multi-mode feature decoding operation.
2. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the multi-view event stream data set E described in step 1 is represented as;
E={ek=(xk,yk,pk,tk)},k∈[1,K],xk∈[1,W],yk∈[1,H],pk∈[1,-1]
Wherein e k is the kth event point data, x k,yk is the pixel coordinate of the kth event point, p k is the event polarity, a polarity of 1 represents the increase of the lighting intensity, and a polarity of-1 represents the decrease of the lighting intensity; t k is the timestamp of the event point; w, H respectively represent the width and height of the space coordinates of the event points, and K represents the number of the event points.
3. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the multi-view image frame dataset F is represented as:
F={In},n∈[1,N],
Wherein I n denotes nth image frame data, W, H denote width and height of an image frame, respectively, and N denotes the total number of image frames.
4. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the method for mapping the multi-view event stream image frame to the reference camera position in the step 2 is specifically as follows:
[xr,yr,1]=KRK-1[x,y,1]+KT/d (1)
in the formula (1), x r,yr represents the mapped image coordinates, R, T represents a rotation and translation matrix from the pixel point to the reference camera position, and x, y represents the image coordinates; k represents an internal reference matrix of the camera, and d is a focusing distance;
From the mapping formula (1), the refocused event stream dataset E r can be obtained:
wherein, K-th event point after refocusing,/>Representing the x, y axis coordinates corresponding to the event point,/>For event polarity, a polarity of 1 indicates that the lighting intensity increases, and a polarity of-1 indicates that the lighting intensity decreases; /(I)As the time stamp of the event point, according to the mapping formula (1), the refocused image frame data set F r can be obtained:
wherein, Represented as the n Zhang Chong th focused image frame.
5. A synthetic aperture imaging method as defined in claim 4, wherein the synthetic aperture imaging method is used in combination with a conventional optical camera, wherein: the refocusing event stream framing procedure described in step 3 is expressed as:
In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:
wherein t k is the timestamp of the kth event point, and t 1 is the timestamp of the first event point; thus, a refocusing event frame data set after frame compression can be obtained:
6. A synthetic aperture imaging method as defined in claim 5, wherein the synthetic aperture imaging method is used in combination with a conventional optical camera, and wherein: the event stream pre-reconstruction and refocusing process in the step 3 is expressed as follows;
In the formula (3), The method is characterized in that the method comprises the steps of representing a J-th pre-reconstruction event frame, wherein J is the total number of pre-reconstruction event frames, x and y represent image coordinates, x k,yk is the image coordinates of a k-th event point, delta is a Dirichlet function, recon (·) is represented as an event stream brightness reconstruction operator, and delta t is represented as the time length adopted by single event frame compression frames; next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set F E→F,r after refocusing is obtained:
wherein, A pre-reconstructed event frame focused for the j Zhang Chong th.
7. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the learning loss function of the network in the step 3 is defined as:
I recon in the formula (4) is a brightness image reconstructed by the hybrid neural network model, I gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), and β per,βL1,βper is expressed as the weight of the corresponding loss, respectively.
8. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the convolution kernels of the three layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32; the pulse cores of the three layers of pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.
9. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the transducer module is an existing Swin-transducer architecture, and the multi-mode feature decoder is an existing ResNet architecture.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210422694.2A CN114862732B (en) | 2022-04-21 | 2022-04-21 | Synthetic aperture imaging method integrating event camera and traditional optical camera |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210422694.2A CN114862732B (en) | 2022-04-21 | 2022-04-21 | Synthetic aperture imaging method integrating event camera and traditional optical camera |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114862732A CN114862732A (en) | 2022-08-05 |
CN114862732B true CN114862732B (en) | 2024-04-26 |
Family
ID=82630677
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210422694.2A Active CN114862732B (en) | 2022-04-21 | 2022-04-21 | Synthetic aperture imaging method integrating event camera and traditional optical camera |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114862732B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115578295B (en) * | 2022-11-17 | 2023-04-07 | 中国科学技术大学 | Video rain removing method, system, equipment and storage medium |
CN116310408B (en) * | 2022-11-29 | 2023-10-13 | 北京大学 | Method and device for establishing data association between event camera and frame camera |
CN115761472B (en) * | 2023-01-09 | 2023-05-23 | 吉林大学 | Underwater dim light scene reconstruction method based on fusion event and RGB data |
CN117939309A (en) * | 2024-03-25 | 2024-04-26 | 荣耀终端有限公司 | Image demosaicing method, electronic device and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111667442A (en) * | 2020-05-21 | 2020-09-15 | 武汉大学 | High-quality high-frame-rate image reconstruction method based on event camera |
CN112987026A (en) * | 2021-03-05 | 2021-06-18 | 武汉大学 | Event field synthetic aperture imaging algorithm based on hybrid neural network |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11037278B2 (en) * | 2019-01-23 | 2021-06-15 | Inception Institute of Artificial Intelligence, Ltd. | Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures |
US11288818B2 (en) * | 2019-02-19 | 2022-03-29 | The Trustees Of The University Of Pennsylvania | Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning |
-
2022
- 2022-04-21 CN CN202210422694.2A patent/CN114862732B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111667442A (en) * | 2020-05-21 | 2020-09-15 | 武汉大学 | High-quality high-frame-rate image reconstruction method based on event camera |
CN112987026A (en) * | 2021-03-05 | 2021-06-18 | 武汉大学 | Event field synthetic aperture imaging algorithm based on hybrid neural network |
Non-Patent Citations (2)
Title |
---|
Event-based Synthetic Aperture Imaging with a Hybrid Network;张翔等;IEEE;20211102;第14235-14242页 * |
基于共焦照明的合成孔径成像方法;项祎祎等;光学学报;20200430;第40卷(第08期);第73-79页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114862732A (en) | 2022-08-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114862732B (en) | Synthetic aperture imaging method integrating event camera and traditional optical camera | |
CN110033003B (en) | Image segmentation method and image processing device | |
CN111047548B (en) | Attitude transformation data processing method and device, computer equipment and storage medium | |
CN109993707B (en) | Image denoising method and device | |
JP6415781B2 (en) | Method and system for fusing detected measurements | |
CN112446270A (en) | Training method of pedestrian re-identification network, and pedestrian re-identification method and device | |
CN115761472B (en) | Underwater dim light scene reconstruction method based on fusion event and RGB data | |
EP2979449B1 (en) | Enhancing motion pictures with accurate motion information | |
CN113076685A (en) | Training method of image reconstruction model, image reconstruction method and device thereof | |
TWI791405B (en) | Method for depth estimation for variable focus camera, computer system and computer-readable storage medium | |
CN111951195A (en) | Image enhancement method and device | |
CN113065645A (en) | Twin attention network, image processing method and device | |
Yuan et al. | A novel deep pixel restoration video prediction algorithm integrating attention mechanism | |
CN110335228B (en) | Method, device and system for determining image parallax | |
CN112115786A (en) | Monocular vision odometer method based on attention U-net | |
KR20220014678A (en) | Method and apparatus for estimating depth of images | |
CN112446835A (en) | Image recovery method, image recovery network training method, device and storage medium | |
CN116797640A (en) | Depth and 3D key point estimation method for intelligent companion line inspection device | |
CN115731280A (en) | Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network | |
CN112819742B (en) | Event field synthetic aperture imaging method based on convolutional neural network | |
CN114820299A (en) | Non-uniform motion blur super-resolution image restoration method and device | |
Sehli et al. | WeLDCFNet: Convolutional Neural Network based on Wedgelet Filters and Learnt Deep Correlation Features for depth maps features extraction | |
Khamassi et al. | Joint denoising of stereo images using 3D CNN | |
CN116939186B (en) | Processing method and device for automatic associative covering parallax naked eye space calculation | |
CN117593702B (en) | Remote monitoring method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |