CN117319806A

CN117319806A - Dim light video enhancement method and device based on event camera assistance

Info

Publication number: CN117319806A
Application number: CN202311320452.3A
Authority: CN
Inventors: 施柏鑫; 梁锦秀; 许勇; 杨溢鑫; 李博宇; 段沛奇
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2023-10-12
Filing date: 2023-10-12
Publication date: 2023-12-29

Abstract

The invention discloses a dim light video enhancement method and device based on event camera assistance, and relates to the technical field of video generation. According to the invention, by introducing the event camera with high time resolution, the extraction of motion information and correlation between the dim light video frames is assisted, the contrast ratio and the signal-to-noise ratio of the dim light video frames are enhanced by utilizing the ideas of multi-frame alignment and fusion, and the enhancement result is greatly superior to that of the existing dim light video enhancement method. Compared with the prior art that the event camera and the frame camera are required to be aligned accurately in space, the method and the device can cope with the situation that the event camera and the frame camera are aligned inaccurately in a real dim light high-speed scene through the multi-mode correlation modeling module, and are wide in application range.

Description

Dim light video enhancement method and device based on event camera assistance

Technical Field

The invention relates to the technical field of video generation, in particular to a dim light video enhancement method and device based on event camera assistance.

Background

In recent years, mobile phone cameras which are updated rapidly and iteratively enable everyone to become a handy photographer, and the large-scale application of unmanned aerial vehicles in social and industrial scenes also promotes the continuous improvement of the shooting capability. Currently, the problem of photography in high speed scenes when the light is insufficient remains a problem in the industry. In photography, very fast shutter speeds are often used to capture fast movements while avoiding blurring, but this also results in very low signal-to-noise ratios. A slower shutter speed may improve the signal to noise ratio but at the same time also brings about motion blur. This shutter speed tradeoff is difficult to address using either the existing deblurring or multi-exposure fusion methods. In the field of professional photography, such as sports video and movie production, one common option is to place large light-compensating lamps in the scene, but their use in general photography scenes is limited by portability and power consumption requirements.

In order to enhance the quality of the video captured in a high-speed scene with darkness by means of computation, video enhancement algorithms are often used to process the captured original darkness video, thereby improving the quality of the degraded image. The key of video enhancement is to use redundancy between video frames and fuse consistent information between the frames before and after the video frames so as to improve video quality. The performance of which depends to a large extent on the quality of the video frame-based optical flow estimation algorithm, which has two basic assumptions: i) The translation of the same point in the scene in space-time is very small; ii) the same point in the scene remains unchanged in brightness during the motion, i.e. the brightness is constant. However, for dim video frames that are less characteristic and contaminated with noise, both hypotheses become very fragile and difficult to use to establish a matching relationship of moving points in the scene. Although the advent of deep learning in recent years has led to a step in the performance of many learning-based dim video enhancement methods, the improvement of video frame quality in fast motion scenes with large pixel displacements is still quite challenging.

In the prior art 1 (Seeing Dynamic Scene in the Dark: A High-Quality Video Dataset With Mechatronic Alignment), underexposed dim light video is mapped to normal illumination video based on end-to-end deep learning, and the network structure is designed based on Retinex theory and comprises three basic modules of frame alignment, noise suppression and illumination enhancement. The algorithm flow for the implementation is shown in fig. 1. The method comprises the following steps:

(1) Inter-frame alignment module f on the left _a In a given continuous 5-frame dim video frame I _t+i ，i∈[-2，2]Firstly extracting their characteristics by three convolution layers and two downsampling layers to spread their mutual information in time-space, then progressively and implicitly distorting the information of adjacent frames to intermediate frames by deformable convolution layers to obtain fused characteristics

(2) Noise suppression module f at lower right _n In (c) featuresMapped into noise figure N _t The noise figure is constrained by the following self-supervised denoising penalty function:

wherein the method comprises the steps ofi∈[-2，2]，/>Is I _t+i Is used for the average RGB values of (a).

(3) Illumination enhancement module f at upper right _i In (c) featuresIs mapped into a lighting map S _t The illumination map is constrained by the following loss function:

wherein the method comprises the steps ofAnd representing the normal illumination video frame corresponding to the input dim light video intermediate frame.

(4) The final result is obtained by the following formula:

however, since the contrast ratio and the signal-to-noise ratio in the dark video frame are extremely low, estimating the corresponding normal illumination video frame directly from a single dark video frame is a very uncomfortable problem, and the prior art can bring a certain effect to improve the stability between the enhanced video frames by additionally adding the guidance of the optical flow during training, but cannot well overcome the discomfort of the problem itself.

The discomfort of the dim light video enhancement problem can be overcome to a certain extent based on the alignment and fusion among the multiple frames of dim light video frames, the signal to noise ratio of the video frames is improved, but the characteristic point amplitude of the video frames is weak and the matching relationship among the frames is difficult to establish due to the influence of large noise and low contrast of the video frames under the dim light, so that the quality of a result is greatly influenced.

The event camera asynchronously records the logarithmic brightness change with high dynamic range, low delay and low power consumption, and provides a promising direction for photographic problems in high-speed scenes with dim light. Event cameras have high temporal resolution on the order of microseconds, have unique advantages for motion estimation, and can provide reliable inter-frame correlation between frames in high-speed motion scenes, thereby assisting video enhancement. Event cameras do not have the concept of exposure time and therefore do not suffer from the dilemma between strong blurring caused by using long exposure times and low signal to noise ratio caused by using short exposure times. In summary, the use of high temporal resolution and high dynamic range information of events to guide the dim video enhancement is a very powerful potential for obtaining high quality video in dim high speed scenes.

Prior art 2 (Learning to See in the Dark with Events) converts events with high dynamic range captured with event cameras under dim light into standard sharp images, uses an unsupervised domain adaptation method to circumvent the difficulty of collecting paired event-image training data, explicitly separates domain invariant features (e.g. scene structure) from domain specific features (e.g. detail textures) to simplify representation learning, and additionally uses a detail enhancement branch to reconstruct features specific to daytime illumination from the domain invariant representation in a residual manner and regularizes with rank loss. The algorithm flow for the implementation is shown in fig. 2. The method comprises the following steps:

(1) Normal light events and dim light events are input to a shared encoder E _c To extract their representationAnd->At the same time, normal lighting events are also input to the private encoder E _p In the same time with a noise channel, generating a source domain specific residual +.>

(2) Ensuring modulated normal illumination characteristics by countermeasure training with discriminator D by additive operationsAnd dim light feature->In a domain-shared feature space.

(3) Detail enhancement branch T _e Reconstructing a normal illumination domain specific residual from the shared features.

(4) Finally, a shared decoder R uses the domain-specific and shared representation to reconstruct the intensity image. "R/F" represents an actual or counterfeit logical value.

The prior art 2 obtains a video frame of normal illumination by converting an event stream having a high dynamic range captured by an event camera into a corresponding light intensity value, however, the event asynchronously records a brightness change, and the frame synchronously records an absolute brightness, and direct conversion between them has a great discomfort, and the presence of noise further increases its difficulty.

Disclosure of Invention

Aiming at the problem that the matching relationship between frames is difficult to establish due to the weak amplitude of the characteristic points in a dim light high-speed scene in the prior art, the invention provides a dim light video enhancement method and device based on event camera assistance, which are used for guiding the complementary fusion of space-time consistent inter-frame information by establishing the matching relationship for the same point in the scene, thereby improving the signal-to-noise ratio of video, and compensating the problem of the space and the mode misalignment between the event and the frame caused by the difference between modes and the difficulty of the pixel-by-pixel alignment under the dim light condition by establishing the multi-scale full-pair correlation between the event and the frame in the characteristic space.

In order to achieve the above object, the present invention provides the following technical solutions:

the invention provides a dim light video enhancement method based on event camera assistance, which comprises the following steps:

s1, acquiring a continuous multi-frame dim light video frame sequence by adopting a hybrid camera system of a frame camera and an event cameraCorresponding event signal E triggered during this time _t-N→t+N Where N is a temporal radius, the spatial resolution h 'x w' of the event camera is lower than the spatial resolution h x w of the frame camera;

s2, utilizing two modal feature encodersAnd->From dim light video frames L respectively _t And event signal E _t→t+1 Extracting frame characteristics->And event feature->

S3, frame characterizationAnd event feature->Obtaining event feature aligned with frame feature space through multi-mode correlation module alignment>Jointly estimated optical flow S _t→t+1 ；

S4, current frame characteristicsNext frame feature->And event features aligned with the current frame feature spaceIn the optical flow S _t→t+1 Is guided by a time-dependent propagation module to be fused into noise-suppressing features +.>

S5, frame characteristicsAnd event feature->The exposure estimation module is input to obtain an exposure parameter map P _t Exposure parameter map P _t Features applied to noise suppression->Via decoder->Get noise reduction frame->Then obtaining the final recovery frame I through inversion _t 。

Further, the four-dimensional correlator extracted by the multi-modal correlation module in step S3The establishment steps are as follows:

first through a single mode optical flow estimatorAnd->From the dim video frames L, respectively _t And event signal E _t→t+1 Extracting frame features having different spatial resolutions +.>And event feature->

Then a four-dimensional correlator is obtained byDim light video frame L of representation _t And event signal E _t→t+1 Full pair correlation between:

wherein p and q represent pixel indices of features extracted from frames and events, respectively;representing a transpose operation;

the correlation score is then calculated using an exponential function of the inner product.

Further, the multi-modal correlation module is used for four-dimensional correlationPerforming three-time pooling with an average pooling factor of 2 along the spatial dimension of the event, thereby obtaining a correlation volume with a scale of s

Further, in step S3, the multi-modal correlation module utilizes a four-dimensional correlatorCharacterizing a frameAnd event feature->Alignment results in feature alignment of frame features and event features +.>The process of (1) comprises:

through a projection matrix M _t For event characteristicsAlignment is performed in the following manner to obtain aligned features

Wherein,representing a projective transformation; m is M _t For a 3 x 3 projection matrix, a cube D is displaced by 2 x 2 of four vectors containing four corner points of the image _t Parameterizing to obtain the product; displacement cube D _t Initialization with identity transformation and then via the global alignment module +.>Performing iterative updating; the global alignment module->Has the following characteristics: first, the displacement cube D obtained in the previous iteration step _t Parameterized as an updated 3×3 projection matrix M _t The method comprises the steps of carrying out a first treatment on the surface of the Let event feature->Is X, which passes through the projection matrix M _t Mapping to a grid coordinate set X 'of the current iteration step, wherein the corresponding displacement flow is F=X' -X; meanwhile, the index X' can be taken as index from the four-dimensional correlator which is downsampled to the size of (32X 32) X (32X 32)>Obtaining a relevant body slice by middle sampling>Then, the relevant body is sectioned->And displacementStream F is input together to the displacement cube decoding +.>The module obtains a displacement cube D of the current iteration step _t 。

Further, in step S3, the multi-modal correlation module utilizes a four-dimensional correlatorFrom frame featuresAnd event feature->Jointly estimated optical flow S _t→t+1 The process of (1) comprises:

through a single mode optical flow estimatorAnd->The feature extractor of (1) is two frames L from front to back _t ,L _t+1 And event E _t→t+1 Frame characteristics between the simultaneous estimation time intervals t and t+1->And event feature->Fusion of the secondary frame features->And event feature->Extracted optical flow->And optical flow->Light flow->Coordinates projected into the frame coordinate system give optical flow +.>Re-use of four-dimensional correlators->Optical flow->And optical flow->Aggregation to obtain an estimated optical flow S between two frames _t→t+1 。

Further, in the dim light video frame L _t Related local area of radius r from center p' in sensor coordinates of each pixel p in (a)To co-estimate from event E _t→t+1 Sum frame L _t ，L _t+1 Optical flow of (2):

where r is the radius of the local area, delta is the displacement variable of p is the pixel p,is an integer set; local areaIs used for indexing from four-dimensional correlators/>Obtaining relevant volume slice->These correlators are used to aggregate the supplemental information from the event to the weights in the frame as residual estimates:

wherein,is->Optical flow projected into frame space, Z is a normalization factor, estimated by a unimodal optical flow estimator +.>Update to obtain an optical flow S for joint estimation of a subsequent temporal correlation _t→t+1 。

Further, in step S4, the characteristics of noise suppression are fused by the time-dependent propagation moduleThe process of (1) comprises:

s41, current dim light video frame L _t The pixel p in (a) appears in the frame L _t+1 Pixels in (a)Where, thereinWhile the optical flow of events->Is divided into two parts:

part of the optical flowOptical flow S with constant velocity, estimated from joint _t→t+1 Sub-sampling is performed in the process; another part of the light flow->Having a time-varying velocity, learned as a stream offset field;

s42, optical flow S according to joint estimation _t→t+1 Characterization of eventsAnd features of the frame->Aligning to an intermediate frame according to time; the next frame L _t+1 Features of->Optical flow S estimated from the union within time intervals t and t+1 _t→t+1 Preregistered to the current frame L according to the following formula _t Features of->

Estimating motion between B time stamps between the current two frames to obtain event characteristicsBased on the space-time neighborhoodSpatial slice registration of the medium continuous velocity part to the current frame feature +.>The following is shown:

wherein the method comprises the steps ofRepresenting event features in the τ th slice; />A variation of the de-registration operation representing relevance perception is shown below:

wherein Z is used for normalization, timing consistency C ^temp Is estimated from successive frames:

s43, from the current frame featureRegistering to the current frame feature->Is->And event feature->And S is _t→t+1 Connecting the predicted stream offsets delta along the channel dimension _t→t+1 ：

Wherein,a stream offset estimator module; stream offset delta _t→t+1 Used as residuals of the stream, together forming an offset field of the deformable convolutional layer for implicitly integrating the event and frame into the potential frame and delivering a temporal consistency at the time stamp t for noise reduction, resulting in noise suppressed features->The following is shown:

wherein,an inter-frame propagation module that directs the flow;

then, the characteristicsDecoded into noise reduced frames.

Further, in step S4, the exposure estimation module approximates the camera response function using a gamma curve with a parameter of 1/2.2 and selects an exposure parameter networkTo predict exposure parameter map P _t 。

Further, in step S4, an exposure parameter map P is obtained by the exposure estimation module _t The process of (1) comprises:

s44, given the characteristics of the event and frame, the exposure parameter gridEstimating Γ in a low-dimensional bilateral space _t ：

S45, according to the input dim light frame L _t Is used to estimate Γ from a grid in a low-dimensional bilateral space _t Extracting an exposure parameter map P for each pixel _t Exposure parameter map P _t Having the same resolution as the input frame:

wherein the method comprises the steps ofRepresenting a slicing operation.

In another aspect, the present invention further provides a camera-assisted dim light video enhancement device based on an event, including the following modules to implement the method of any one of the above:

the system comprises a hybrid camera system of a frame camera and an event camera, wherein the hybrid camera system is used for acquiring a continuous multi-frame dim light video frame sequence and a corresponding event signal triggered in the period of time, the spatial resolution of the event camera is lower than that of the frame camera, the frame camera and the event camera are respectively connected through a spectroscope, the spectroscope divides incident light into two outgoing light beams, and the two outgoing light beams synchronously enter the two cameras;

two modality feature encoderAnd->Respectively for video frames L from darkness _t And event signal E _t→t+1 Extracting frame characteristics->And event feature->

A multi-modal correlation module for characterizing framesAnd event feature->Alignment, resulting in event feature aligned with frame feature space +.>Jointly estimated optical flow S _t→t+1 ；

Time-dependent propagation module for aligning features of frames and eventsIn the optical flow S _t→t+1 Is fused to noise suppression feature under guidance of (2)>

An exposure estimation module for estimating the characteristicsAnd event feature->Performing exposure estimation to obtain an exposure parameter map P _t ；

DecoderFor mapping exposure parameters P _t Features applied to noise suppression->Get noise reduction frame->Then obtaining the final recovery frame I through inversion _t 。

Compared with the prior art, the invention has the beneficial effects that:

1. according to the event camera-assisted dim light video enhancement method, the appropriate exposure degree is estimated by combining the dim light video frames through the high dynamic range event stream, so that the video frames with high contrast ratio are recovered.

2. The invention introduces the assistance of an event camera to help the dim light video frame to carry out the inter-frame motion information estimation, solves the problem that the inter-frame matching relationship is difficult to establish due to the weak amplitude of the characteristic points in the dim light high-speed scene, and guides the complementary fusion of the inter-frame information with the consistent space time by establishing the matching relationship to the same point in the scene, thereby improving the signal to noise ratio of the video.

3. The invention establishes the multi-scale full-pair correlation between the event and the frame in the feature space to compensate the problem of the spatial and modal misalignment between the event and the frame caused by the difference between the modes and the difficulty of pixel-by-pixel alignment under the dim light condition.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments described in the present invention, and other drawings may be obtained according to these drawings for a person having ordinary skill in the art.

Fig. 1 is a flow chart of a method of prior art 1.

Fig. 2 is a flow chart of a method of prior art 2.

FIG. 3 is a flow chart of a method provided by the present invention.

Fig. 4 is a diagram of a hybrid camera system according to an embodiment of the present invention.

Detailed Description

In order to better understand the technical solution, the technical solution in the embodiment of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiment of the present invention. It will be apparent that the described examples are only some embodiments, but not all embodiments, of the present invention. Based on the embodiments of the present invention, those of ordinary skill in the art will be able to devise all other embodiments that are obtained based on this application and are within the scope of the present invention.

The invention provides a dim light video enhancement method based on event camera assistance, as shown in fig. 3, comprising the following steps:

S3, frame characterizationAnd event feature->Obtaining the alignment characteristic of the frame characteristic and the event characteristic by the alignment of the multi-mode correlation module>Jointly estimated optical flow S _t→t+1 ；

S4, frame and event alignment featureIn the optical flow S _t→t+1 Is guided by a time-dependent propagation module to be fused into noise-suppressing features +.>Then the exposure estimation module obtains an exposure parameter map P _t ；

S5, mapping the exposure parameters P _t Features for noise suppressionVia decoder->Get noise reduction frame->Then obtaining the final recovery frame I through inversion _t ：

Wherein, as follows, the element-wise multiplication, f _CRF As a potential response function of the camera,is f _CRF Is the inverse function of (c).

Each module has a correspondingly designed neural network to implement different functions, as described in detail below.

1. Multi-modal correlation module

Multi-modal correlation moduleThe establishment steps are as follows:

The four-dimensional multi-modal correlation module (correlator, correlation volume) is then obtained byDim light video frame L of representation _t And event signal E _t→t+1 Full pair correlation between:

the correlation score is then calculated using an exponential function of the inner product. When the correlation score is calculated using an exponential function of the inner product, the magnitude of the strong correlation is corrected and the weak correlation due to misaligned pixels or modal differences is suppressed.

To further expand the perception domain, the multi-modal correlation modulePerforming three-time pooling with an average pooling factor of 2 along the spatial dimension of the event, thereby obtaining a correlation volume on a coarser scale sThereby modeling correspondence on different scales with high resolution details maintained.

In step S3, use is made ofGlobal dependencies are extracted. In particular, the multimodal correlation module->Frame feature->And event feature->Alignment to obtain a rough alignment of frame features and event features +.>To compensate for resolution differences and possible sensor misalignment. Global misalignment caused by sensor misalignment or camera motion can be effectively regularized by a projection matrix with few parameters. The present invention performs global alignment in a low resolution feature space, taking into account the robustness of errors and noise under dim light conditions. Specifically, the process comprises:

through a projection matrix M _t For event characteristicsAccording toAlignment is performed in the following manner to obtain aligned features

Wherein,representing a projective transformation; m is M _t For a 3 x 3 projection matrix, a cube D is displaced by 2 x 2 of four vectors containing four corner points of the image _t Parameterized, can be conveniently converted into a projection matrix; displacement cube D _t Initialization from the same transformation and then via the global alignment module->And performing iterative updating. The global alignment module->Has the following characteristics: first, the displacement cube D obtained in the previous iteration step _t Parameterized as an updated 3×3 projection matrix M _t . Let event feature->Is X, which passes through the projection matrix M _t Mapping to the grid coordinate set X 'of the current iteration step, the corresponding displacement stream is f=x' -X. Meanwhile, the four-dimensional correlator with X' as an index can be downsampled from (32X 32) X (32X 32)>Obtaining a relevant body slice by middle sampling>Then, the related body is slicedInput to the Displacement cube decoding with Displacement stream F +.>The module obtains a displacement cube D of the current iteration step _t 。

The present invention allows for global feature alignment to model sensor misalignment and camera motion between events and frames, and additionally uses pixel-level motion integration to process residual correspondences caused by object motion or patch repetition over exposure time. The present invention estimates optical flow S from event and frame combination _t→t+1 To supplement the correlation at the modeling pixel level.

For motion estimation, event cameras have overwhelming advantages, especially for large displacements and occlusions. However, they typically have a lower spatial resolution and are triggered only in areas with "motion edges", lacking information of low texture areas. Fortunately, despite the weaker features, a fine-grained appearance of high resolution is still preserved in the darkened video frames at the boundary timestamp, which can supplement the motion information in the event.

In the invention, the multi-modal correlation moduleFrom frame feature->And event feature->Jointly estimated optical flow S _t→t+1 The process of (1) comprises:

through a single mode optical flow estimatorAnd->Feature extractor of (1) from front and back two frames of features L _t ，L _t+1 And event feature E _t→t+1 Frame characteristics between the simultaneous estimation time intervals t and t+1->And event feature->Fusion of the secondary frame features->And event feature->Extracted optical flow->And optical flow->Light flow->Coordinates projected into the frame coordinate system give optical flow +.>Re-use of four-dimensional correlators->Aggregation to obtain an estimated optical flow S between two frames _t→t+1 。

For computational efficiency, the present invention only considers the video frame L in dim light _t Related local area of radius r from center p' in sensor coordinates of each pixel p in (a)(rather than all position associations in all self-attentions) with a common estimateSelf event E _t→t+1 Sum frame L _t ，L _t+1 Optical flow S of (2) _t→t+1 ：

Where r is the radius of the local area, δp is the displacement variable of pixel p,is an integer set; local area->Is used for indexing from the four-dimensional correlator +.>Obtaining a slice of a correlator->These correlators are used to aggregate the supplemental information from the event to the weights in the frame as residual estimates:

2. Time correlation propagation module

Step S4In (c) fusing features of noise suppression by a time-dependent propagation moduleThe process of (1) comprises:

s41, current frame L _t The pixel p in (a) appears in the frame L _t+1 P + delta of the pixel in (b) _t+1 At (p), whereinHowever, for asynchronous events, their duration of motion across t to t+1 is typically not constant, with a temporal resolution on the order of microseconds. To fill in the time difference between the event and the frame for more accurate motion estimation, the present invention decomposes the optical flow of the event into two parts: part of the optical flow S having a constant velocity estimated from the union _t→t+1 Sub-sampling is performed in the process; the other part has a time-varying velocity, learned as a stream offset field;

s42, optical flow S according to joint estimation _t→t+1 Characterizing eventsAnd features of the frame->Aligned to the intermediate frame according to time. The next frame L _t+1 Features of->Optical flow S estimated from the union within time intervals t and t+1 _t→t+1 Preregistered to the current frame L according to the following formula _t Features of->

In two frames L _t ，L _t+1 Estimating motion between B time stamps to obtain event characteristicsRegistering to the current frame characteristic according to the space slice of the continuous speed part in the space-time neighborhood>The following is shown:

wherein the method comprises the steps ofRepresenting event features in the τ th slice; />A variation of the inverse registration (warping) operation representing relevance perception is shown below:

s43, c) to compensate for temporal misalignment between event and frame, the present invention characterizes from the current frameRegistering to the current frame feature->Is->And event feature->And S is _t→t+1 Connecting the predicted stream offsets delta along the channel dimension _t→t+1 ：

wherein,an inter-frame propagation module that directs the flow;

then, the characteristicsDecoded into noise reduced frames. />

3. Exposure estimation module

The exposure estimation module of the invention approximates the camera response function by using the gamma curve with the parameter of 1/2.2 and selects the exposure parameter networkTo predict exposure parameter map P _t 。

Specifically, in step S4, an exposure parameter map P is obtained by the exposure estimation module _t The process of (1) comprises:

wherein the method comprises the steps ofRepresenting a slicing operation.

The invention adopts synthetic data to train the neural network, and the specific training process is as follows:

1. acquiring a composite dataset

The training dataset used the video segmentation dataset DAVIS, comprising 107 pairs of synthesized normal light and dim light videos (6208 frames), randomly split into 87 videos for the training set and 20 videos for the test set. The present invention uses gamma correction and linear scaling to synthesize a dim light video frame from a normal light frame I _t Synthesis of noise-free dim light frame L _t ：

L _t (p)＝β×(α×I _t (p)) ^γ

Wherein alpha, beta, gamma are uniformly dividedClothAnd (3) sampling.

To meet the requirements for mixed input of events and dim video, we further synthesize events using video-to-event simulator v2 e. The spatial resolution of all frames is 854 x 480 and the spatial resolution of events is 427 x 240 to simulate the spatial resolution difference between the two modalities. In practice, events and frames in a hybrid camera system are difficult to perfectly align, so the present invention applies a random perspective transformation between the two modalities to simulate.

Noise under dim light conditions is a key factor of concern for the present invention. For events, the present invention simulates the degradation that may be caused by insufficient illumination with a V2E simulator, such as limited bandwidth, more light leakage events, and photon noise. For frames, in order to enable the proposed method to generalize in complex realistic situations, the present invention uses a more realistic dim light frame degradation process. At low intensities, the poisson distribution has very different characteristics from the gaussian distribution related to the signal, so the invention samples photon noise from the poisson distribution, and the noise scale is fromRather than the usual gaussian approximation. Second, JPEG compression often occurs in digital images. In dim light conditions, where the features are weaker, the resulting artifacts become more pronounced, so JPEG compression is used in the simulation process, the quality factor is from +.>And (3) sampling.

2. Implementation details of neural networks

Event information preprocessing: the original event signal is represented as being in a pixelAnd a four-tuple e= (t, p, σ) triggered over time t when the logarithmic variation R of irradiance exceedsPredefined threshold θ # ->0) The method is triggered by the following steps:

wherein R is _t Representing the instantaneous intensity of time t, the polarity σ ε { -1, +1} represents the luminance variation of { negative, positive }, such representation being discrete, sparse, inconvenient to process of the convolutional neural network. In the present invention, events are processed into a tensor representation of a voxel grid by spanning a time period Δt=t by K events _K-1 -t ₀ Discrete into B time periods, each event e _k ＝(t _k ,p _k ,σ _k ) The polarity sigma of which is calculated _k Assigned to the two nearest voxels as follows:

wherein the method comprises the steps ofIs a normalized timestamp and the number of channels B is taken 15 in the present invention.

Event and frame optical flow estimatorAnd->Initialization is performed from pre-training models RAFT and E-RAFT.

3. Neural network training

The whole network uses an end-to-end training mode, and uses l ₁ Combination of loss and gradient loss:

wherein lambda is _grad =10 and λ _exposure =100 isSuper parameters for balancing the contributions of different terms.

The first two normal optical frames I for normalized prediction in intensity and gradient domain _t Ground truthFidelity between:

and

the third term is used to normalize the exposure parameters to correctly enhance the blurred version of the dark frame to its normal light counterpart:

wherein the method comprises the steps ofIs a gaussian blur operation with variance of 2 to improve the robustness of the exposure parameters to noise.

Data enhancement: the training process randomly intercepts the dim light video frames and the 128×128 blocks of the corresponding event signals, and uses horizontal flipping and rotation for data enhancement, wherein the rotation angles comprise 90 degrees, 180 degrees and 270 degrees.

Code-dependent PyTorch framework implementation, optimization of neural network using Adam optimizer with initial learning rate of 1×10 ^-4 The batch size was set to 4 and 50 cycles were trained by a single NVIDIA TITAN RTX GPU.

Corresponding to the method, the invention also provides a dim light video enhancement device based on event camera assistance, which comprises the following modules to realize the method of any one of the above:

A multi-modal correlation module for characterizing framesAnd event feature->Alignment, feature alignment of frame feature and event feature is obtained +.>Jointly estimated optical flow S _t→t+1 ；

An exposure estimation module for suppressing noisePerforming exposure estimation to obtain an exposure parameter map P _t ；

An embodiment of the present invention is shown in fig. 4, and includes the following steps:

1. a hybrid camera system is built. The hybrid camera system comprises a common frame camera (FLIR Chameleon 3Color, spatial resolution 1920 x 1280, event resolution 20 fps) and an event camera (DAVIS 346, spatial resolution 346 x 260, time accuracy about 1 microsecond) connected by a spectroscope (Thorlabs CCM1-BS 013). The two cameras adopt the same lens, the spectroscope divides the incident light into two beams of emergent light, and the emergent light synchronously enters the two camera sensors.

2. Event signal preprocessing: the discrete event signals are converted into representations of voxel grids.

3. And inputting the multi-frame dim light video frames shot by the frame camera and the voxel grid representations of the corresponding event signals into the trained neural network to obtain a final enhanced normal light video result.

According to the invention, by introducing the event camera with high time resolution, the extraction of motion information and correlation between the dim light video frames is assisted, the contrast ratio and the signal-to-noise ratio of the dim light video frames are enhanced by utilizing the ideas of multi-frame alignment and fusion, and the enhancement result is greatly superior to that of the existing dim light video enhancement method. Compared with the prior art that the event camera and the frame camera are required to be aligned accurately in space, the method and the device can cope with the situation that the event camera and the frame camera are aligned inaccurately in a real dim light high-speed scene through the multi-mode correlation modeling module, and are wide in application range.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for the apparatus embodiments, the electronic device embodiments, the computer-readable storage medium embodiments, and the computer program product embodiments, the description is relatively simple, as relevant to the description of the method embodiments in part, since they are substantially similar to the method embodiments.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. The dim light video enhancement method based on event camera assistance is characterized by comprising the following steps of:

S4, current frame characteristicsNext frame feature->And event feature aligned with the current frame feature space +.>In the optical flow S _t→t+1 Is guided by a time-dependent propagation module to be fused into noise-suppressing features +.>

2. The event camera-assisted dim light video enhancement method according to claim 1, wherein the four-dimensional correlation extracted by the multi-modal correlation module in step S3The establishment steps are as follows:

3. The event camera assisted dim light video enhancement method according to claim 2, wherein the multi-modal correlation module is configured to correlate four-dimensional correlatorsPerforming three-time pooling with an average pooling factor of 2 along the spatial dimension of the event, thereby obtaining a correlation body +.>

4. The event camera-assisted dim light video enhancement method according to claim 1, wherein in step S3, the multi-modal correlation module utilizes a four-dimensional correlation volumeFrame feature->And event feature->Alignment results in feature alignment of frame features and event features +.>The process of (1) comprises:

Wherein,representing a projective transformation; m is M _t For a 3 x 3 projection matrix, a cube D is displaced by 2 x 2 of four vectors containing four corner points of the image _t Parameterizing to obtain the product; displacement cube D _t Initialized with identity transformation and then passed through global alignment modulePerforming iterative updating; the global alignment module->Has the following characteristics: first, the displacement cube D obtained in the previous iteration step _t Parameterized as an updated 3×3 projection matrix M _t The method comprises the steps of carrying out a first treatment on the surface of the Let event feature->Is X, which passes through the projection matrix M _t Mapping to a grid coordinate set X 'of the current iteration step, wherein the corresponding displacement flow is F=X' -X; meanwhile, the four-dimensional correlator which takes X' as an index and can be downsampled to the size of (32X 32) X (32X 32) from +.>Obtaining a relevant body slice by middle sampling>Then, the relevant body is sectioned->Input to the displacement cube decoding module together with the displacement stream F>Then the displacement cube D of the current iteration step is obtained _t 。

5. The event camera-assisted dim light video enhancement method according to claim 1, wherein in step S3, the multi-modal correlation module utilizes a four-dimensional correlation volumeFrom frame feature->And event feature->Jointly estimated optical flow S _t→t+1 The process of (1) comprises:

through a single mode optical flow estimatorAnd->The feature extractor of (1) is two frames L from front to back _t ，L _t+1 And event E _t→t+1 Frame characteristics between the simultaneous estimation time intervals t and t+1->And event feature->Fusing slave frame featuresAnd event feature->Extracted optical flow->And optical flow->Light flow->Coordinates projected into the frame coordinate system give optical flow +.>Re-use of four-dimensional correlators->Optical flow->And optical flow->Aggregation to obtain an estimated optical flow S between two frames _t→t+1 。

6. The event camera-assisted based darkening video enhancement method of claim 5 wherein, in the darkening video frame L _t Related local area of radius r from center p' in sensor coordinates of each pixel p in (a)To co-estimate from event E _t→t+1 Sum frame L _t ，L _t+1 Optical flow S of (2) _t→t+1 ：

Where r is the radius of the local area, δp is the displacement variable of pixel p,is an integer set; local area->Is used for indexing from the four-dimensional correlator +.>Obtaining relevant volume slice->These correlators are used to aggregate the supplemental information from the event to the weights in the frame as residual estimates:

7. The event camera-assisted dim light video enhancement method according to claim 1, wherein in step S4, the features of noise suppression are fused by a time correlation propagation moduleThe process of (1) comprises:

s41, current dim light visionFrequency frame L _t The pixel p in (a) appears in the frame L _t+1 Pixels in (a)Where, thereinOptical flow of event>Is decomposed into two parts:

part of the optical flow S' _t→t+1 Optical flow S with constant velocity, estimated from joint _t→t+1 Sub-sampling is performed in the process; another part of the light flowHaving a time-varying velocity, learned as a stream offset field;

s42, optical flow S according to joint estimation _t→t+1 Characterizing eventsAnd features of the frame->Aligning to an intermediate frame according to time; the next frame L _t+1 Features of->Optical flow S estimated from the union within time intervals t and t+1 _t→t+1 Preregistered to the current frame L according to the following formula _t Features of->

In two frames L _t ,L _t+1 Estimating motion between B time stamps to obtain event characteristicsRegistering to the current frame characteristic according to the space slice of the continuous speed part in the space-time neighborhood>The following is shown:

wherein,an inter-frame propagation module that directs the flow;

then, the characteristicsDecoded into noise reduced frames.

8. The event camera-assisted dim light video enhancement method according to claim 1, wherein in step S4, the exposure estimation module approximates the camera response function using a gamma curve with a parameter of 1/2.2 and selects the exposure parameter networkTo predict exposure parameter map P _t 。

9. The method for enhancing a dark-light video based on the event camera assistance according to claim 8, wherein in step S4, an exposure parameter map P is obtained by an exposure estimation module _t The process of (1) comprises:

wherein the method comprises the steps ofRepresenting a slicing operation.

10. A camera-assisted dim-light video enhancement device based on an event, comprising the following modules to implement the method of any one of claims 1-9:

A time correlation propagation module for current frame characteristicsNext frame feature->And event feature aligned with the current frame feature space +.>In the optical flow S _t→t+1 Is fused to noise suppression feature under guidance of (2)>

An exposure estimation module for characterizing the frameAnd event feature->Performing exposure estimation to obtain an exposure parameter map P _t ；