CN114862732B - Synthetic aperture imaging method integrating event camera and traditional optical camera - Google Patents

Synthetic aperture imaging method integrating event camera and traditional optical camera Download PDF

Info

Publication number
CN114862732B
CN114862732B CN202210422694.2A CN202210422694A CN114862732B CN 114862732 B CN114862732 B CN 114862732B CN 202210422694 A CN202210422694 A CN 202210422694A CN 114862732 B CN114862732 B CN 114862732B
Authority
CN
China
Prior art keywords
event
image
camera
synthetic aperture
frame
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210422694.2A
Other languages
Chinese (zh)
Other versions
CN114862732A (en
Inventor
余磊
廖伟
张翔
王阳光
杨文�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan University WHU
Original Assignee
Wuhan University WHU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan University WHU filed Critical Wuhan University WHU
Priority to CN202210422694.2A priority Critical patent/CN114862732B/en
Publication of CN114862732A publication Critical patent/CN114862732A/en
Application granted granted Critical
Publication of CN114862732B publication Critical patent/CN114862732B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a synthetic aperture imaging method for fusing an event camera with a traditional optical camera, which fuses the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, and reconstructs a high-quality non-shielding target image frame to finish high-quality penetrating imaging tasks in shielding scenes with various densities. The invention comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning ability of the neural network, thereby realizing the high-quality image reconstruction of the target in various dense shielding scenes and further enhancing the robustness and the applicability of the synthetic aperture imaging technology.

Description

Synthetic aperture imaging method integrating event camera and traditional optical camera
Technical Field
The invention belongs to the field of image processing, and particularly relates to a synthetic aperture imaging method for fusing an event camera with a traditional optical camera.
Background
Synthetic aperture imaging (SYNTHETIC APERTURE IMAGING, SAI) techniques use cameras to make multi-view observations of a scene, thereby equating to one large aperture virtual camera. Because the larger the aperture of the camera is and the smaller the depth of field is, when the shooting target is shielded, the imaging of the shielded target can be realized by blurring the shielding object by the synthetic aperture imaging, so that the method has extremely high application value in the fields of three-dimensional reconstruction, target tracking, identification and the like.
Current synthetic aperture imaging algorithms mainly use a sequence of multi-view image frames taken by a conventional optical camera for perspective imaging. However, as the obscuration becomes denser, the target light information contained in the image frames decreases dramatically, and the interfering light information of the obscuration dominates, resulting in serious performance degradation of the traditional optical camera-based synthetic aperture imaging algorithm. In recent years, a synthetic aperture imaging algorithm based on event cameras has been proposed to solve the perspective imaging problem in dense occlusion scenes. The event camera senses the brightness change of the pixel point in the logarithmic domain, asynchronously outputs event stream data with high time resolution and high dynamic range, and can sense the target approximately continuously, so that sufficient target information can be obtained in a dense shielding scene. Because the existing synthetic aperture imaging algorithm of the event camera images event points generated based on the brightness difference of the target and the shielding object, when in a sparse shielding scene, the effective event points are reduced, so that the performance of the synthetic aperture imaging algorithm based on the event camera is degraded. Considering that the traditional optical camera and event camera based synthetic aperture imaging algorithm have better performance in sparse shielding scenes and dense shielding scenes respectively, the information of the two can be fully utilized for fusion imaging, and synthetic aperture imaging in various dense shielding scenes is realized. However, since the data modality of the event stream is completely different from that of the image frame data, there is still a great difficulty in how to establish a bridge between the two to realize fusion imaging.
Disclosure of Invention
Aiming at the problems, the invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which combines the advantages of the two cameras in synthetic aperture imaging application, constructs a bridge between an event stream and an image frame by constructing a neural network architecture based on a pulse neural network and a convolution neural network, and reconstructs a high-quality non-occlusion target image frame to finish high-quality penetration imaging tasks in multiple intensive occlusion scenes.
The invention provides a synthetic aperture imaging method based on an event camera and a traditional optical camera, which comprises the following specific steps:
step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing the image frame data set in a non-occlusion scene.
Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;
Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, performing pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network as a training set to obtain a target reconstructed image of network prediction, and combining a non-occlusion image of the target and a network learning loss function, and performing iterative optimization on network parameters through an ADAM optimizer to obtain the trained hybrid neural network;
the hybrid neural network model described in step3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
The cross-attention enhancement module comprises a plurality of cross-attention transducer modules for enhancing the extracted features for a plurality of times;
the density sensing module comprises a feature cascade of enhanced features and original features, then a convolution layer, global average pooling, global maximum pooling and a full connection layer are adopted, and finally weighted summation is carried out to output the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing images according to the fused characteristics;
and 4, inputting the event stream of the blocked target to be reconstructed and the image frame into the trained network for prediction to obtain a de-blocked reconstructed image corresponding to the blocked target.
Further, the multi-view event stream data set E in step 1 is:
E={ek=(xk,yk,pk,tk)},k∈[1,K],xk∈[1,W],yk∈[1,H],pk∈[1,-1]
Wherein e k is the kth event point data, x k,yk is the pixel coordinates of the kth event point, and p k is the event polarity (polarity 1 indicates that the luminance increases and polarity-1 indicates that the luminance decreases); t k is the timestamp of the event point; w, H respectively represent the width and height of the space coordinates of the event points, and K represents the number of the event points.
Further, the multi-view image frame data set F in step 1 is:
Wherein I n denotes nth image frame data, W, H denote width and height of an image frame, respectively, and N denotes the total number of image frames. Further, the mapping of the multi-view event stream image frame to the reference camera position in step 2 is specifically as follows:
[xr,yr,1]=KRK-1[x,y,1]+KT/d (1)
In the formula (1), x r,yr represents the mapped image coordinates, x, y represents the image coordinates, R, T represents a rotation and translation matrix from the pixel point to the reference camera position, K represents an internal reference matrix of the camera, d is a refocusing distance, and generally is set as a distance from the blocked target to the camera plane. According to the mapping formula, the refocused event stream data set E r can be obtained:
wherein, K-th event point after refocusing,/>Representing the x, y axis coordinates corresponding to the event point,/>For event polarity (polarity 1 representing the increase in the light intensity and polarity-1 representing the decrease in the light intensity); /(I)Is the timestamp of the event point. From the mapping formula (1), the refocused image frame data set F r can be obtained:
wherein, Represented as the n Zhang Chong th focused image frame.
Further, the refocusing event stream framing procedure described in step 3 is expressed as:
In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:
Where t k is the timestamp of the kth event point and t 1 is the timestamp of the first event point. Thus, a refocusing event frame data set F E,r after the frame compression can be obtained:
further, the event stream pre-reconstruction process described in step 3 is expressed as:
In the formula (3), The J is represented by a J-th pre-reconstruction event frame, J is the total number of pre-reconstruction event frames, x and y represent image coordinates herein, x k,yk is the image coordinates of a k-th event point, delta (·) is represented by a dirichlet function, recon (·) is represented by an event stream brightness reconstruction operator, a currently mainstream event stream reconstruction algorithm is generally used, and delta t is represented by the time length adopted by a single event frame compression frame. Next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set F E→F,r after refocusing is obtained:
wherein, A pre-reconstructed event frame focused for the j Zhang Chong th.
Further, the hybrid neural network model in step 3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module. Refocus event frame F E,r after network input frame compression, refocus Jiao Tu frame F r and refocus event stream pre-reconstruction event frame F E→F,r are firstly input into a multi-mode coding module by three branches for coarse feature extraction
fF,0,fE,0,fE→F,0=MF-Encoder(Fr,FE,r,FE→F,r)
Wherein F F,0,fE,0,fE→F,0 represents the coarse features extracted from F r,FE,r,FE→F,r, and MF-Encoder (·) represents the multi-mode encoding operation, which performs feature extraction on the three signals, respectively. For both signals of F r,FE→F,r, feature extraction was performed using a three-layer convolutional layer architecture containing a jump connection. For the F E,r signal, feature extraction was performed using a three-layer pulse layer containing jump links.
Next, the features are enhanced using a cross-attention enhancement module:
fF,M,fE,M,fE→F,M=CME(fF,0,fE,0,fE→F,0)
Wherein, f F,0,fE,0,fE→F,0 is three-way characteristics output by the multi-mode coding module in the last step, f F,M,fE,M,fE→F,M is three-way characteristics after cross-attention enhancement, CME (·) is cross-mode enhancement operation, and the characteristics are enhanced for a plurality of times by a transducer module containing M cross-attention, and the process is expressed as:
fF,m,fE,m,fE→F,m=C-Transformer(fF,m-1,fE,m-1,fE→F,m-1),m∈[1,M]
Wherein f F,m-1,fE,m-1,fE→F,m-1 is three-way characteristics input into the mth cross-attention transducer module, f F,m,fE,m,fE→F,m is three-way characteristics output, C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, the operation is performed on each path of signals respectively, the normalization is performed on each path of signals, the respective attention information is calculated, the cross-modal information enhancement is realized by adding and fusing the attention information of the three paths of signals, the characteristic enhancement is performed by utilizing a multi-layer perceptron, and the enhanced characteristic f F,m,fE,m,fE→F,m is output.
Next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:
fALL=DAF(fF,M,fE,M,fE→F,M,Fr,FE,r,FE→F,r)
Wherein F F,M,fE,M,fE→F,M is three enhanced features output in the previous step, F r,FE,r,FE→F,r is three original signals input by a network, F ALL is a feature output after fusion, DAF (-) is a density perception fusion operation, the input features are cascaded on feature dimensions with corresponding original inputs, the features are input into a single-layer convolution layer for feature extraction, then a global average pooling and a global maximum pooling are used for extracting an average value and a maximum value of the three features respectively, the three features are input into a full-connection layer for calculating a weight value, finally weighted summation is carried out on the three signals, and the fused features F ALL are output.
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
Irecon=MF-Decode(fALL)
Wherein f ALL is the fused feature output by the previous module, I recon is the reconstructed luminance image, MF-Encoder (·) is a multi-mode feature decoding operation, which generally uses a mainstream convolutional neural network architecture.
Further, the learning loss function in step 3The definition is as follows:
I recon in the formula (4) is a brightness image reconstructed by the network, I gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), the above-mentioned loss functions are all known functions commonly used at present, and beta perL1per is expressed as the weight of the corresponding loss respectively.
Furthermore, the input event stream and the image frame data in the step 4 need to be subjected to the same refocusing process as that in the step 2, then the same event stream preprocessing process in the step 3 is performed, and then the trained neural network is input, so that a corresponding target reconstruction image can be obtained.
Compared with the prior art, the invention has the advantages that:
The invention provides a synthetic aperture imaging method for fusing an event camera and a traditional optical camera, which comprehensively utilizes the advantages of the event camera and the traditional optical camera and uses the strong learning ability of a neural network, thereby realizing high-quality image reconstruction of a target in various dense shielding scenes and further enhancing the robustness and applicability of the synthetic aperture imaging technology.
Drawings
FIG. 1 is a schematic diagram of an experimental scenario including an occluded object, an occlusion object, a programmable sled, and a camera sensor.
FIG. 2 is a flow chart of the overall process of the present invention.
Fig. 3 is a schematic diagram of a camera moving data acquisition process.
Fig. 4 is a contrast diagram of an image frame, an event frame and a pre-reconstruction event frame after data preprocessing.
Fig. 5 is a schematic diagram of a hybrid neural network structure.
FIG. 6 is a graph comparing the present method with different synthetic aperture imaging methods. And (3) the synthetic aperture imaging results of densely-occluded scenes with different behaviors from top to bottom. The first column from left to right is an occlusion diagram, the second column is an occlusion-free reference image, the third column is a synthetic aperture imaging algorithm (F-SAI+ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI+CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI+ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI+hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
Detailed Description
In order to more clearly understand the present invention, the technical contents of the present invention will be more clearly and completely described in the following description of an example in conjunction with fig. 1. It is apparent that the described examples are only some, but not all, of the embodiments of the invention. All other examples, based on examples in this invention, which a person of ordinary skill in the art would obtain without making any inventive effort, are within the scope of the invention.
The specific facts of the invention are a synthetic aperture imaging method which fuses an event camera with a traditional optical camera.
A specific implementation schematic scene of the invention is shown in figure 1 of the accompanying drawings, and comprises the following steps: the camera sensor, the programmable slide rail, shelter from thing and sheltered from the goal;
The camera sensor model is a Davis346 event camera, and the camera can synchronously output event stream and image frame data and is used for constructing a corresponding data set;
the Davis346 event camera is fixed on a programmable slide rail which is set to move linearly at a constant speed;
the whole flow chart of the invention is shown in figure 2 of the accompanying drawings, and the specific steps are as follows:
step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing an image frame data set in a non-occlusion scene, as shown in fig. 3;
The multiview event stream data E described in step 1 is
E={e(k})=(xk,yk,pk,tk)},k∈[1,K],xk∈[1,W],yk∈[1,H],pk∈[1,-1]
Wherein e k is the kth event point data, x k,yk is the pixel coordinates of the kth event point; p k is the event polarity (polarity 1 represents the increase in the light intensity, and polarity-1 represents the decrease in the light intensity); t k is the timestamp of the event point; w=346, h=260 represents the width and height of the event point space coordinates, respectively, and K represents the number of event points.
The multi-view image frame data F in step 1 is:
I n is the nth image frame data, w=346, h=260 is the width and height of the image frame, and n=30 is the total number of image frames.
Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;
the method for mapping the multi-view event stream image frame to the reference camera position in the step 2 is specifically as follows:
[xr,yr,1]=KRK-1[x,y,1]+KT/d (1)
In formula (1), x r,yr represents the mapped image coordinates, R, T represents the rotation and translation matrix from the pixel point to the reference camera position, K represents the internal reference matrix of the camera, d is the focusing distance, and generally the distance from the blocked target to the camera plane is set. In this routine, since the camera movement mode is set to be uniform linear movement, it can be considered that the clock is not rotated during the photographing process of the camera, and thus the rotation matrix is a pair of angular unit matrices. At time t, the translation matrix calculation mode can be modeled as
Tt=[vtrack(t-tr)0 0]
Wherein v track = 0.0885m/s is the movement speed of the guide rail, and t r is the time stamp of the camera at the reference position when each shooting.
From the mapping formula (1), refocused event stream data E r can be obtained:
wherein, For the k-th event point after focusing,/>Representing the x, y-axis coordinates/>, corresponding to the event pointFor event polarity (polarity 1 representing the increase in the light intensity and polarity-1 representing the decrease in the light intensity); /(I)Is the timestamp of the event point. From the mapping formula (1), the refocused image frame data set F r can be obtained:
wherein, Represented as the nth refocused image frame.
Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, carrying out pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM optimizer by combining a non-occlusion image of the target and a network learning loss function to obtain a trained hybrid neural network;
the refocusing event stream framing procedure described in step 3 is expressed as:
In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:
Where t k is the timestamp of the kth event point and t 1 is the timestamp of the first event point. Thus, a refocusing event frame data set F E,r after the frame compression can be obtained:
The process of pre-reconstructing and refocusing the event stream described in step 3 is expressed as:
In the formula (3), Expressed as J-th pre-reconstructed event frame, J is the total number of pre-reconstructed event frames, x, y represent image coordinates here, x k,yk are the image coordinates of the kth event point, delta (·) is expressed as dirichlet function, recon (·) is expressed as event stream brightness reconstruction operator, the existing E2VID algorithm is used in this routine, and Δt is expressed as the time length adopted by the single event frame compression frame. Next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set after refocusing is obtained:
wherein, And (3) the j-th focused pre-reconstructed event image.
The refocused image frames, event frames, and pre-reconstructed event frames are each visually illustrated in fig. 4 of the drawings.
The structure of the hybrid neural network in the step 3 is shown in fig. 5 of the accompanying drawings. The hybrid neural network model described in step 3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module. Refocus event frame F E,r after frame compression input by the network, refocus Jiao Tu frame F r and refocus event stream pre-reconstruction event frame F E→F,r are input into a multi-mode coding module by three branches for coarse feature extraction:
fF,0,fE,0,fE→F,0=MF-Encoder(Fr,FE,r,FE→F,r)
Wherein F F,0,fE,0,fE→F,0 represents the coarse features extracted from F r,FE,r,FE→F,r, and MF-Encoder (·) represents the multi-mode encoding operation, which performs feature extraction on the three signals, respectively. For both signals F r,FE→F,r, a three-layer convolution layer architecture including jump connection is used for feature extraction, and in this embodiment, the convolution kernels of the three-layer convolution layer are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32. For the F E,r signal, feature extraction is performed using three pulse layers including jump links, and in this embodiment, the pulse kernel sizes of the three pulse layers are 1, 3, and 7, and the pulse kernel numbers are 8, 16, and 32.
Next, the features are enhanced using a cross-attention enhancement module:
fF,M,fE,M,fE→F,M=CME(fF,0,fE,0,fE→F,0)
Wherein, f F,0,fE,0,fE→F,0 is three-way characteristics output by the multi-mode coding module in the last step, f F,M,fE,M,fE→F,M is three-way characteristics after cross-attention enhancement, CME (·) is cross-mode enhancement operation, which includes multiple enhancement of the characteristics by m=6 cross-attention transducer modules:
fF,m,fE,m,fE→F,m=C-Transformer(fF,m-1,fE,m-1,fE→F,m-1),m∈[1,M]
Wherein f F,m-1,fE,m-1,fE→F,m-1 is three-way characteristics of the m-th cross-attention transducer module, f F,m,fE,m,fE→F,m is three-way characteristics of the output, and C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, which normalizes each path of signal first and calculates respective attention information, adds and fuses the attention information of the three paths of signals to realize cross-modal information enhancement, and utilizes a multi-layer perceptron to perform characteristic enhancement, and outputs the enhanced characteristic f F,m,fE,m,fE→F,m.
Next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:
fALL=DAF(fF,M,fE,M,fE→F,M,Fr,FE,r,FE→F,r)
Wherein F F,M,fE,M,fE→F,M is three enhanced features output in the previous step, F r,FE,r,FE→F,r is three original signals input by a network, F ALL is a feature output after fusion, DAF (·) is a density sensing fusion operation, the input features are cascaded with the corresponding original input in feature dimension, and are input into a single-layer convolution layer for feature extraction, in this embodiment, the convolution kernel size is 3, and the number of convolution kernels is 32. And then, respectively extracting the average value and the maximum value of the three paths of features by using global average pooling and global maximum pooling, inputting the three paths of features into a full-connection layer to calculate a weight, and finally, carrying out weighted summation on the three paths of signals to output the fused feature f ALL.
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
Irecon=MF-Decode(fALL)
Wherein f ALL is the fused feature output by the previous module, I recon is the reconstructed luminance image, MF-Encoder (·) is the multi-mode feature decoding operation, and in this embodiment, the existing ResNet architecture is used as the multi-mode feature decoder.
The learning loss function of the network in the step 3 is defined as:
I recon in the formula (4) is a brightness image reconstructed by the network, I gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), the above-mentioned loss functions are all known functions commonly used at present, and β perL1per is expressed as the weight of the corresponding loss, respectively, and in this routine, β per=1,βL1=32,βper =0.0002 is set.
And 4, inputting the event stream of the blocked target to be reconstructed and the image frame into the trained network for prediction to obtain a de-blocked reconstructed image corresponding to the blocked target.
The input event stream and image frame data in the step 4 firstly need to carry out the refocusing process the same as that in the step 2, then carry out the preprocessing process of the event stream the same as that in the step 3, and then input the trained neural network, thus obtaining the corresponding target reconstruction image.
Figure 6 compares the results of the algorithm of the present invention with other synthetic aperture imaging algorithms in a variety of occlusion scenes. And (3) the synthetic aperture imaging results of densely-occluded scenes with different behaviors from top to bottom. The first column from left to right is an occlusion diagram, the second column is an occlusion-free reference image, the third column is a synthetic aperture imaging algorithm (F-SAI+ACC) based on a traditional optical camera, the fourth column is a synthetic aperture imaging algorithm (F-SAI+CNN) based on a traditional optical camera and a convolutional neural network, the fifth column is a synthetic aperture imaging algorithm (E-SAI+ACC) based on an event camera and an accumulation method, the sixth column is a synthetic aperture imaging algorithm (E-SAI+hybrid) based on an event camera and a Hybrid neural network, and the seventh column is an inventive algorithm.
It should be understood that technical portions that are not specifically set forth in the present specification are all prior art.
It should be understood that the foregoing description of the preferred embodiments is not to be construed as limiting the scope of the invention, but that substitutions and modifications can be made by one of ordinary skill in the art without departing from the scope of the invention as defined by the appended claims.

Claims (9)

1. A synthetic aperture imaging method for fusing an event camera with a conventional optical camera, comprising the steps of:
Step 1, constructing a multi-view event stream and an image frame data set in an occlusion scene, and constructing an image frame data set in a non-occlusion scene;
Step 2, selecting a reference camera position, mapping a multi-view event stream and an image frame to the reference camera position according to a multi-view geometric principle, and refocusing an occluded target to obtain a refocused event stream and an image frame data set;
Step 3, constructing a hybrid neural network model, namely compressing a refocused event stream to obtain an event frame, performing pre-reconstruction processing on an unfocused event stream to obtain a pre-reconstructed event frame, refocusing, inputting the pre-reconstructed event frame and the refocused image frame data set into the hybrid neural network model as a training set to obtain a target reconstructed image after network prediction, and iteratively optimizing network parameters through an ADAM optimizer by combining a non-occlusion image of the target and a network learning loss function to obtain a trained hybrid neural network model;
the hybrid neural network model described in step3 includes the following modules: the system comprises a multi-mode coding module, a cross-attention enhancement module, a density sensing module and a multi-mode decoding module;
the multi-mode coding module comprises a plurality of convolution layers or pulse layers and is used for extracting features;
The cross-attention enhancement module comprises a plurality of cross-attention transducer modules for enhancing the extracted features for a plurality of times;
the density sensing module comprises a feature cascade of enhanced features and original features, then a convolution layer, global average pooling, global maximum pooling and a full connection layer are adopted, and finally weighted summation is carried out to output the fused features;
the multi-mode decoding module is an existing convolutional neural network architecture and is used for reconstructing images according to the fused characteristics;
Step 4, inputting the shielded target event stream to be reconstructed and the image frame into a trained hybrid neural network model for prediction to obtain a de-shielded reconstructed image corresponding to the shielded target;
Refocus event frame F E,r after frame compression input by the hybrid neural network model, refocus Jiao Tu frame F r and refocus event stream pre-reconstruction event frame F E→F,r are input into a multi-mode coding module by three branches for coarse feature extraction:
fF,0,fE,0,fE→F,0=MF-Encoder(Fr,FE,r,FE→F,r)
Wherein F F,0,fE,0,fE→F,0 represents the coarse features extracted from F r,FE,r,FE→F,r, and MF-Encoder (DEG) represents the multi-mode encoding operation, which performs feature extraction on three paths of signals; for two paths of signals F r,FE→F,r, a three-layer convolution layer architecture containing jump connection is used for extracting features; for the F E,r signal, performing feature extraction by using three pulse layers containing jump links;
next, the features are enhanced using a cross-attention enhancement module:
fF,M,fE,M,fE→F,M=CME(fF,0,fE,0,fE→F,0)
Wherein, f F,0,fE,0,fE→F,0 is three-way characteristics output by the multi-mode coding module, f F,M,fE,M,fE→F,M is three-way characteristics after cross-attention enhancement, CME (-) is cross-mode enhancement operation, and the characteristics are enhanced for a plurality of times by a transducer module containing M cross-attention, and the process is expressed as follows:
fF,m,fE,m,fE→F,m=C-Transformer(fF,m-1,fE,m-1,fE→F,m-1),m∈[1,M]
Wherein, f F,m-1,fE,m-1,fE→F,m-1 is three-way characteristics input into an mth cross-attention transducer module, f F,m,fE,m,fE→F,m is three-way characteristics output, C-transducer (-) represents the characteristic enhancement operation of the cross-attention transducer, each path of signal is normalized firstly, the respective attention information is calculated, the cross-modal information enhancement is realized by adding and fusing the attention information of the three paths of signals, the characteristic enhancement is carried out by utilizing a multi-layer perceptron, and the enhanced characteristic f F,m,fE,m,fE→F,m is output;
next, the density sensing module is used for carrying out weighted fusion on the three paths of characteristics:
fALL=DAF(fF,M,fE,M,fE→F,M,Fr,FE,r,FE→F,r)
Wherein F F,M,fE,M,fE→F,M is three paths of enhanced features output in the cross-attention enhancement module, F r,FE,r,FE→F,r is three paths of input original signals, F ALL is a feature output after fusion, DAF (-) is a density perception fusion operation, the input features are cascaded in feature dimension with corresponding original input, the features are input into a single-layer convolution layer for feature extraction, then three paths of features are respectively extracted into a mean value and a maximum value by global average pooling and global maximum pooling, the three paths of features are input into a full connection layer for calculating weight, and finally weighted summation is carried out on the three paths of signals, and the fused features F ALL are output;
Finally, decoding the features by using a multi-mode decoding module, and outputting a final brightness reconstruction image:
Irecon=MF-Decode(fALL)
Wherein f ALL is the fused feature output by the density sensing module, I recon is the reconstructed brightness image, and MF-Decoder (-) is the multi-mode feature decoding operation.
2. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the multi-view event stream data set E described in step 1 is represented as;
E={ek=(xk,yk,pk,tk)},k∈[1,K],xk∈[1,W],yk∈[1,H],pk∈[1,-1]
Wherein e k is the kth event point data, x k,yk is the pixel coordinate of the kth event point, p k is the event polarity, a polarity of 1 represents the increase of the lighting intensity, and a polarity of-1 represents the decrease of the lighting intensity; t k is the timestamp of the event point; w, H respectively represent the width and height of the space coordinates of the event points, and K represents the number of the event points.
3. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the multi-view image frame dataset F is represented as:
F={In},n∈[1,N],
Wherein I n denotes nth image frame data, W, H denote width and height of an image frame, respectively, and N denotes the total number of image frames.
4. A synthetic aperture imaging method of a fused event camera and legacy optical camera as defined in claim 1, wherein: the method for mapping the multi-view event stream image frame to the reference camera position in the step 2 is specifically as follows:
[xr,yr,1]=KRK-1[x,y,1]+KT/d (1)
in the formula (1), x r,yr represents the mapped image coordinates, R, T represents a rotation and translation matrix from the pixel point to the reference camera position, and x, y represents the image coordinates; k represents an internal reference matrix of the camera, and d is a focusing distance;
From the mapping formula (1), the refocused event stream dataset E r can be obtained:
wherein, K-th event point after refocusing,/>Representing the x, y axis coordinates corresponding to the event point,/>For event polarity, a polarity of 1 indicates that the lighting intensity increases, and a polarity of-1 indicates that the lighting intensity decreases; /(I)As the time stamp of the event point, according to the mapping formula (1), the refocused image frame data set F r can be obtained:
wherein, Represented as the n Zhang Chong th focused image frame.
5. A synthetic aperture imaging method as defined in claim 4, wherein the synthetic aperture imaging method is used in combination with a conventional optical camera, wherein: the refocusing event stream framing procedure described in step 3 is expressed as:
In the formula (2), The data of the event frames after the J-th pressed frame is represented, J is the total number of the event frames, x and y represent image coordinates, and the data of the event frames are represented as' xFor the image coordinates of the kth refocusing event point, δ (·) is represented as a dirichlet function, Δt is represented as the length of time taken for the single event frame to be compressed, and its calculation mode is represented as:
wherein t k is the timestamp of the kth event point, and t 1 is the timestamp of the first event point; thus, a refocusing event frame data set after frame compression can be obtained:
6. A synthetic aperture imaging method as defined in claim 5, wherein the synthetic aperture imaging method is used in combination with a conventional optical camera, and wherein: the event stream pre-reconstruction and refocusing process in the step 3 is expressed as follows;
In the formula (3), The method is characterized in that the method comprises the steps of representing a J-th pre-reconstruction event frame, wherein J is the total number of pre-reconstruction event frames, x and y represent image coordinates, x k,yk is the image coordinates of a k-th event point, delta is a Dirichlet function, recon (·) is represented as an event stream brightness reconstruction operator, and delta t is represented as the time length adopted by single event frame compression frames; next, using the mapping formula (1) described above, for each image/>Mapping is carried out aiming at the reference camera position, and a pre-reconstruction event frame data set F E→F,r after refocusing is obtained:
wherein, A pre-reconstructed event frame focused for the j Zhang Chong th.
7. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the learning loss function of the network in the step 3 is defined as:
I recon in the formula (4) is a brightness image reconstructed by the hybrid neural network model, I gt is a true value image corresponding to the target image in the dataset, To perceive the loss (perceptual loss)/>For L1 norm loss,/>Expressed as total variation loss (total variation), and β perL1per is expressed as the weight of the corresponding loss, respectively.
8. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the convolution kernels of the three layers are respectively 3, 5 and 7, and the number of the convolution kernels is 8, 16 and 32; the pulse cores of the three layers of pulse layers are 1, 3 and 7, and the number of the pulse cores is 8, 16 and 32.
9. A synthetic aperture imaging method of a fused event camera and conventional optical camera as defined in claim 1, wherein: the transducer module is an existing Swin-transducer architecture, and the multi-mode feature decoder is an existing ResNet architecture.
CN202210422694.2A 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera Active CN114862732B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210422694.2A CN114862732B (en) 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210422694.2A CN114862732B (en) 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera

Publications (2)

Publication Number Publication Date
CN114862732A CN114862732A (en) 2022-08-05
CN114862732B true CN114862732B (en) 2024-04-26

Family

ID=82630677

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210422694.2A Active CN114862732B (en) 2022-04-21 2022-04-21 Synthetic aperture imaging method integrating event camera and traditional optical camera

Country Status (1)

Country Link
CN (1) CN114862732B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115578295B (en) * 2022-11-17 2023-04-07 中国科学技术大学 Video rain removing method, system, equipment and storage medium
CN116310408B (en) * 2022-11-29 2023-10-13 北京大学 Method and device for establishing data association between event camera and frame camera
CN115761472B (en) * 2023-01-09 2023-05-23 吉林大学 Underwater dim light scene reconstruction method based on fusion event and RGB data
CN117939309A (en) * 2024-03-25 2024-04-26 荣耀终端有限公司 Image demosaicing method, electronic device and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667442A (en) * 2020-05-21 2020-09-15 武汉大学 High-quality high-frame-rate image reconstruction method based on event camera
CN112987026A (en) * 2021-03-05 2021-06-18 武汉大学 Event field synthetic aperture imaging algorithm based on hybrid neural network

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11037278B2 (en) * 2019-01-23 2021-06-15 Inception Institute of Artificial Intelligence, Ltd. Systems and methods for transforming raw sensor data captured in low-light conditions to well-exposed images using neural network architectures
US11288818B2 (en) * 2019-02-19 2022-03-29 The Trustees Of The University Of Pennsylvania Methods, systems, and computer readable media for estimation of optical flow, depth, and egomotion using neural network trained using event-based learning

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111667442A (en) * 2020-05-21 2020-09-15 武汉大学 High-quality high-frame-rate image reconstruction method based on event camera
CN112987026A (en) * 2021-03-05 2021-06-18 武汉大学 Event field synthetic aperture imaging algorithm based on hybrid neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Event-based Synthetic Aperture Imaging with a Hybrid Network;张翔等;IEEE;20211102;第14235-14242页 *
基于共焦照明的合成孔径成像方法;项祎祎等;光学学报;20200430;第40卷(第08期);第73-79页 *

Also Published As

Publication number Publication date
CN114862732A (en) 2022-08-05

Similar Documents

Publication Publication Date Title
CN114862732B (en) Synthetic aperture imaging method integrating event camera and traditional optical camera
CN110033003B (en) Image segmentation method and image processing device
CN111047548B (en) Attitude transformation data processing method and device, computer equipment and storage medium
CN109993707B (en) Image denoising method and device
JP6415781B2 (en) Method and system for fusing detected measurements
CN112446270A (en) Training method of pedestrian re-identification network, and pedestrian re-identification method and device
CN115761472B (en) Underwater dim light scene reconstruction method based on fusion event and RGB data
EP2979449B1 (en) Enhancing motion pictures with accurate motion information
CN113076685A (en) Training method of image reconstruction model, image reconstruction method and device thereof
TWI791405B (en) Method for depth estimation for variable focus camera, computer system and computer-readable storage medium
CN111951195A (en) Image enhancement method and device
CN113065645A (en) Twin attention network, image processing method and device
Yuan et al. A novel deep pixel restoration video prediction algorithm integrating attention mechanism
CN110335228B (en) Method, device and system for determining image parallax
CN112115786A (en) Monocular vision odometer method based on attention U-net
KR20220014678A (en) Method and apparatus for estimating depth of images
CN112446835A (en) Image recovery method, image recovery network training method, device and storage medium
CN116797640A (en) Depth and 3D key point estimation method for intelligent companion line inspection device
CN115731280A (en) Self-supervision monocular depth estimation method based on Swin-Transformer and CNN parallel network
CN112819742B (en) Event field synthetic aperture imaging method based on convolutional neural network
CN114820299A (en) Non-uniform motion blur super-resolution image restoration method and device
Sehli et al. WeLDCFNet: Convolutional Neural Network based on Wedgelet Filters and Learnt Deep Correlation Features for depth maps features extraction
Khamassi et al. Joint denoising of stereo images using 3D CNN
CN116939186B (en) Processing method and device for automatic associative covering parallax naked eye space calculation
CN117593702B (en) Remote monitoring method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant