CN114612305A

CN114612305A - Event-driven video super-resolution method based on stereogram modeling

Info

Publication number: CN114612305A
Application number: CN202210245281.1A
Authority: CN
Inventors: 查正军; 傅雪阳; 曹成志; 时格格; 朱禹睿
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2022-03-14
Filing date: 2022-03-14
Publication date: 2022-06-10
Anticipated expiration: 2042-03-14
Also published as: CN114612305B

Abstract

The invention discloses an event-driven video super-resolution method based on stereogram modeling, which comprises the following steps: 1. acquiring video data and a corresponding event sequence, and segmenting the event sequence; 2. constructing a pixel attention module to perform feature extraction on the image features; 3. re-sampling the adjacent area of the initial event through a sampling module, and repeatedly adjusting the characteristics of the sampling event; 4. performing stereogram modeling on each sampling event by using a neighborhood through an event graph module to gather local features in the neighborhood and gradually increase the receptive field in the whole stereostream; 5. and enabling the event characteristic and the image characteristic to interact through the characteristic interaction module. The method can fully utilize the prior information provided by the event data to drive the video super-resolution, thereby effectively improving the super-resolution effect.

Description

Event-driven video super-resolution method based on stereogram modeling

Technical Field

The invention relates to the field of video super-resolution, in particular to an event-driven video super-resolution method based on stereogram modeling.

Background

Video, which is an important data source in computer vision communication, inevitably has blur due to the motion of objects, thereby affecting the subjective quality of experience and further application. Video deblurring has attracted considerable attention in order to eliminate the adverse effects of blurring. Due to the significant loss of motion information during the blurring process, it is not feasible to recover sharp video sequences from motion blurred images. Recently, a new sensor called an event camera has been proposed for recording and capturing scene intensity changes on the order of microseconds for which fast motion can be captured as events at a high time rate, providing new opportunities for exploring solutions for video deblurring.

Video, which is an important data source in computer visual communication, inevitably reduces image quality due to various external factors, which affects subjective quality and application breadth. In order to improve the definition of images, video super-resolution has attracted much attention. Recently, a new type of sensor called an event camera has been proposed for recording and capturing microsecond-level scene intensity variations. For event cameras, fast motion can be captured as high time rate events, which provides new opportunities for exploring solutions for video super-resolution.

The event camera is a new type of sensor that can asynchronously record brightness changes by adding or sub-tracking signals with microsecond precision using predefined thresholds. Due to low latency, low cost and high temporal resolution, the event camera can be applied to many applications, such as video interpolation, super-resolution, deblurring, intensity image reconstruction and event stream de-noising. However, most commercial cameras produce a relatively low resolution stream to achieve high efficiency. Because high-resolution images/videos are more beneficial to computer vision tasks such as gesture recognition, target tracking and classification, more and more researchers pay attention to the super-resolution of the event-guided intensity frame and have achieved academic success.

Generally, event-based super-resolution methods can be classified into three categories: (1) combining a network for transmitting the event to the intensity image with a super-resolution algorithm; (2) an HR intensity image is constructed by directly super-resolution event flow without intensity assistance; (3) the mixed signal (e.g., APS frame and event) is taken as input to achieve spatial resolution in the intensity image. However, these pipeline methods typically compress the event stream into event frames that have the same channel and scale as the video frames. This strategy makes the spatial correlation in the stereo event stream underutilized, producing unrealistic results. In addition, these methods also use the same process to process frames and events to extract features, ignoring important distinctions between sparse event streams and dense video frames. Thus, sparse features in the event cannot be applied correctly, since the shared weights in the local perceptual field and convolutional layer distort most of the area. These problems limit further development of event-based video super-resolution principle research.

Disclosure of Invention

In order to overcome the defects of the existing method, the invention provides the event-driven video super-resolution method based on the stereogram modeling, so that better deblurring performance can be achieved in video super-resolution tasks of different scenes.

In order to solve the technical problems, the invention adopts the following technical scheme:

the event-driven video super-resolution method based on stereogram modeling is characterized by comprising the following steps of:

step 1, acquiring training video data and a corresponding event sequence, and segmenting the event sequence:

step 1.1 obtainingTaking a real low-resolution video image set, storing the real low-resolution video image set by taking a frame as a unit, and recording the real low-resolution video image set as X ═ { X ═ X₁,x₂,...,x_i,...,x_NIn which x_iThe image processing method includes the steps that an ith frame low-resolution image is represented, wherein i is 1,2, the number of frames of the low-resolution image is N, and N is the number of the frames of the low-resolution image; obtaining an event sequence corresponding to a low-resolution video image set X, dividing the event sequence into a corresponding number of event sequences according to the number of frames N, and recording the event sequences as E ═ { E }₁,e₂,...,e_i,...,e_N}，e_iRepresenting an event sequence corresponding to the ith frame of low-resolution image;

acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y₁,y₂,...,y_j,...,y_NIn which y_jRepresenting a j frame clear image;

let I ═ { X, E, Y } denote the training image dataset;

step 2, constructing a video super-resolution neural network, comprising the following steps: the system comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;

step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:

the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the convolution kernel size in each convolution layer is ks, and the step length is s;

the ith frame low resolution image x_iInputting the image into the pixel attention module, and processing the image by using the 1 st convolution layer respectively to obtain the i frame low-resolution image x_iKey of (2)_iSum value_i(ii) a Wherein, the key_iObtaining dictionary keys through the processing of the 2 nd convolutional layer; at the same time, key_iObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining an incidence matrix A after performing product operation on the dictionary key and the query key_i,key；

Value_iAfter passing through 1 convolution layer, a processed value of'_iAnd associated with the incidence matrix A_i,keyMultiplying to obtain the ith low resolutionRate image x_iImage feature of

C represents the number of channels;

step 2.2 event sequence e corresponding to i-th frame low resolution image_iInputting the data into the sampling module for sampling:

step 2.2.1 sequence e of the ith event_iAfter 3D voxelization, averagely dividing the voxelization into H blocks, then carrying out average processing on each block to obtain H sampled event key points, and marking the H-th event key point as an H-th event key point

Step 2.2.2 Key points to the h-th event

Is set as an event feature flat_i,hAt the h-th event key point

The region outside as its neighborhood

Step 2.2.3 Critical Point to h event Using equation (1)

Corresponding event feature feat_i,hUpdating to obtain the updated h event key point

Event feature feat'_i,h：

feat′_i,h＝Up(Conv(feat_i,h)) (1)

Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;

step 2.3 event graph Module utilization(2) Event feature { feat 'for all event key points'_i,h}_HPerforming stereo graphic operation to obtain the ith event sequence e_iGlobal event feature of

(ii) in formula (2) { feat'_i,h}_HRepresents the ith event sequence e_iEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'_i,qEvent features representing the qth event key point;

represents a product function and has:

step 2.4, constructing the feature interaction module and fusing the event features and the image features:

the feature interaction module comprises x downsampling layers and y convolution layers sharing weight;

image features

And global event features

Inputting the common features of the g-th group into the feature interaction module, obtaining the common features of the g-th group by using an equation (4) after x downsampling layers and y convolution layers sharing weights

Thereby obtaining the ith event sequence e_iAnd the ith low resolution image x_iAre all characterized in

In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,

representing global event features

The G-th group of characteristics, G represents the number of groups;

step 2.5, the feature interaction module is used for outputting a final high-resolution image;

step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence e_iAnd the ith low-resolution image x_iAre all characterized in

As input data for the p-th iteration;

step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;

2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by the characteristic interaction module, obtaining input data of the (p + 1) th iteration;

step 2.5.4 assigning p +1 to p, then judging p>If the P is established, inputting the input data of the (P + 1) th iteration into the decoding module for processing so as to obtain a j frame high-resolution predicted image

Otherwise, the sequence returns to step 2.5.2Executing;

step 3, constructing a back propagation loss function L by using the formula (5)_MSE：

In the formula (3), R is the j-th frame high-resolution predicted image

The number of the pixel points of (a),

for the ith low resolution image x_iThe r-th pixel point of the high-resolution image generated by the neural network,

for the ith high resolution image Y in the high resolution video image set Y_iThe corresponding r-th pixel point;

step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating a loss function L_MSEWhile at the same time at a learning rate lr_sUpdating the network weight, and stopping training when the training iteration times reach the set times or the loss error is less than the set threshold value, so as to obtain an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.

Compared with the prior art, the invention has the beneficial effects that:

1. the method utilizes event data to drive the video super-resolution task, can realize good end-to-end super-resolution effect under the condition of small parameter number, reduces the parameter number compared with the prior method, and has better robustness on different data sets. Experimental results show that the method provided by the invention is superior to the most advanced method in synthesizing a data set and a real data set.

2. The invention perceives the temporal correlation between adjacent event sequences through a stereo event feature mechanism. To take advantage of the high temporal resolution information provided by events for adjusting the coordinates of sampled events, capturing the correlation of neighboring regions, and perceiving long-term correlations in the overall event stream.

3. The time memory module is used for calculating the long-term correlation of different events so as to recover the correlation of time events. The final deblurring network is constructed based on the two blocks and is trained in an end-to-end mode; the similarity between the query and the key is used to measure a temporal non-local correspondence to the current event, which will generate a corresponding value to perceive the temporal change; obtaining a correlation matrix of the event at the T moment and the adjacent event sequence through product operation, and using the correlation matrix to fuse the event characteristics; the time information between different event sequences will be continuously recovered.

4. The present invention uses a feature interaction module to fuse image features and event features. With additional information in the event, more details are provided for frame super-resolution, and the frame features can iteratively and adaptively fine-tune the event features. By modeling the global relation of the space and the channel, the global information of the input features is deeply mined, so that the super-resolution performance of the image is improved, and the interpretability of the model is increased.

Drawings

FIG. 1 is a diagram of a super-resolution method of event-driven video based on stereogram modeling according to the present invention;

FIG. 2 is a block diagram of a sampling module according to the present invention;

FIG. 3 is a block diagram of an event graph module according to the present invention;

FIG. 4 is a block diagram of a feature interaction module of the present invention;

FIG. 5 is a flow chart of the inventive method.

Detailed Description

In this embodiment, a specific flow of the event-driven video super-resolution method based on stereogram modeling is shown in fig. 1, which is to comprehensively consider features of event data and a video sequence, and make full use of prior information provided by the event data to drive video super-resolution, so that a super-resolution effect can be effectively improved, and an algorithm structure diagram of the whole method is shown in fig. 2. Specifically, the method comprises the following steps:

step 1.1, a real low-resolution video image set is obtained and stored in units of frames, and is recorded as X ═ X₁,x₂,...,x_i,...,x_NIn which x_iRepresenting the ith frame of low-resolution image, wherein i is 1,2, and N is the frame number of the low-resolution image; obtaining an event sequence corresponding to the low-resolution video image set X, and dividing the event sequence into a corresponding number of event sequences according to the frame number N, and recording the event sequences as E ═ { E }₁,e₂,...,e_i,...,e_N}，e_iRepresenting an event sequence corresponding to the ith frame of low-resolution image;

acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y₁,y₂,...,y_j,...,y_NIn which y_jRepresenting a clear image of a j frame;

let I ═ { X, E, Y } denote the training image dataset;

in this embodiment, an NFS data set training and evaluation model is used, which includes 100 video sequences of different scenes, 80 of which are selected for training the model, and the rest of which are used for evaluating the model;

step 2, constructing a video super-resolution neural network, comprising the following steps: the device comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;

the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the sizes of convolution kernels in the convolution layers are ks, and the step lengths are s;

ith frame low resolution image x_iInputting into pixel attention module, and processing with 1 st convolution layer to obtain i frame low resolutionImage x_iKey of (2)_iSum value_i(ii) a Wherein, the key_iObtaining dictionary keys through the processing of the 2 nd convolutional layer; at the same time, key_iObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining the incidence matrix A after carrying out product operation on the dictionary key and the query key_i,key；

Value_iAfter passing through 1 convolution layer, a processed value of'_iAnd associated with the incidence matrix A_i,keyMultiplying to obtain the ith low-resolution image x_iImage feature of

C represents the number of channels;

in this embodiment, the size of the convolution kernel is3 × 3, the step size is 1, and C is 32;

step 2.2 event sequence e corresponding to ith frame of low-resolution image_iThe sampling module is input to perform sampling processing, and the specific structure of the sampling module is shown in fig. 3:

step 2.2.1 sequence the ith event e_iAfter 3D voxelization, averagely dividing the voxelization into H blocks, then carrying out average processing on each block to obtain H sampled event key points, and marking the H-th event key point as an H-th event key point

In this embodiment, H is 512;

step 2.2.2 Key Point for h event

Is set as an event feature flat_i,hAt the h-th event key point

As its neighborhood

Step 2.2.3 alignment of the second step with the formula (1)h event key points

Corresponding event feature feat_i,hUpdating to obtain the updated h-th event key point

Event feature feat'_i,h：

feat′_i,h＝Up(Conv(feat_i,h)) (1)

step 2.3 event map Module event features { feat 'for all event keypoints with equation (2)'_i,h}_HPerforming stereo graphic operation to obtain the ith event sequence e_iGlobal event feature of

represents a product function and has:

wherein the event map module is shown in FIG. 4;

step 2.4, constructing a feature interaction module and fusing the event features and the image features:

the characteristic interaction module comprises x downsampling layers and y convolution layers sharing weight;

image features

And global event features

Inputting the common characteristics of the g-th group obtained by the formula (3) after x downsampling layers and y convolution layers sharing weight in the characteristic interaction module

Thereby obtaining the ith event sequence e_iAnd the ith low resolution image x_iAre all characterised by

representing global event features

The G-th group of characteristics, G represents the number of groups; in this embodiment, x is 1, y is3, and G is 4;

step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence e_iAnd the ith low resolution image x_iAre all characterized in

As input data for the p-th iteration; in this embodiment, pTaking 8;

2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by a characteristic interaction module, obtaining input data of the (p + 1) th iteration;

step 2.5.4 assigning p +1 to p, then judging p>If the P is established, inputting the input data of the (P + 1) th iteration into a decoding module for processing so as to obtain a j frame high-resolution predicted image

Otherwise, the step 2.5.2 is returned to be executed in sequence;

the specific structure of the feature interaction module is shown in fig. 5;

step 3, constructing a back propagation loss function L by using the formula (3)_MSE：

In the formula (3), R is the j-th frame high-resolution predicted image

The number of the pixel points of (a),

step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating the lossFunction L_MSEWhile at the same time at a learning rate lr_sTo update the network weight, in this embodiment, the learning rate lr_sAnd taking 5 e-5. When the training iteration times reach the set times or the loss error is smaller than the set threshold value, stopping training so as to obtain an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence thereof by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.

Examples

In order to verify the effectiveness of the method, a common synthetic data set and a common real data set are selected for training and testing.

In view of the proposed end-to-end training of SGM networks, a data set is required containing LR strength frames, corresponding event streams and HR strength frame inputs. In this embodiment, event data is generated using v2e using a network trained using a composite dataset. The NFS dataset and GoPro dataset at high frame rate and high resolution are used as input sources. Thus, high resolution intensity images are readily available. To simulate a real APS frame, the frame size of the video is reduced to 128 × 128 in this embodiment to generate an LR event stream using V2E. The corresponding HR intensity frame is simply down-sampled according to a training high scale factor (x 2 or x 4). The composite data set generated 3828 data tuples from the 132 video sequences. In order to improve the generalization capability of the network to real event data, sampling is carried out according to normal distribution with the mean value of 0.15 and the standard deviation of 0.03, and positive and negative contrast thresholds are randomly set when an event is generated. For the test data set, a composite data set and a real data set are also set to evaluate the method. The composite test data set consists of 841 intensity images and a corresponding stream of analog events between two consecutive frames in 19 videos. For real-world test datasets, an HQF dataset was selected, consisting of real events and low resolution frames in different outdoor and indoor scenes, captured by a DAVIS346 camera.

The structural similarity (PSNR) and the peak signal-to-noise ratio (SSIM) and the perceptual image block similarity (LPIPS) are used as evaluation indexes.

In the embodiment, five methods are selected for effect comparison with the method provided by the invention, and the selected methods are DPT, E2SRI, DCSR, EvIntSR, eSL-Net, SPADE and SGM-Net respectively.

The results obtained from the experimental results are shown in tables 1 and 2:

TABLE 1 Experimental results of experiments conducted on SR x2 using the method of the present invention and six selected comparative methods

Methods	DPT	E2SRI	DCSR	EvIntSR	eSL-Net	SPADE	SGM-Net
								PSNR	26.10	23.05	25.06	23.13	28.41	23.89	30.77
SSIM	0.874	0.784	0.804	0.776	0.880	0.773	0.913
								LPIPS	0.107	0.192	0.121	0.151	0.092	0.139	0.063

TABLE 2 Experimental results of experiments on SR x4 using the method of the present invention and six selected comparative methods

Methods	DPT	E2SRI	DCSR	EvIntSR	eSL-Net	SPADE	SGM-Net
								PSNR	25.76	21.06	19.51	23.25	26.80	21.11	28.40
SSIM	0.841	0.729	0.688	0.745	0.869	0.701	0.897
								LPIPS	0.088	0.192	0.229	0.149	0.099	0.191	0.082

As can be seen from tables 1 and 2, when the super-resolution is doubled and quadrupled on the same data set, the method of the present invention has better effect than other six methods, thereby proving the feasibility of the method proposed by the present invention. Experiments show that the method provided by the invention can fully utilize prior information provided by event data to complete a super-resolution task of the video.

Claims

1. An event-driven video super-resolution method based on stereogram modeling is characterized by comprising the following steps:

step 1.1, a real low-resolution video image set is obtained and stored in units of frames, and is recorded as X ═ X₁,x₂,...,x_i,...,x_NIn which x_iRepresenting the ith frame of low-resolution image, wherein i is 1,2, and N is the frame number of the low-resolution image; obtaining an event sequence corresponding to a low-resolution video image set X, dividing the event sequence into a corresponding number of event sequences according to the number of frames N, and recording the event sequences as E ═ { E }₁,e₂,...,e_i,...,e_N}，e_iRepresenting an event sequence corresponding to the ith frame of low-resolution image;

let I ═ { X, E, Y } denote the training image dataset;

step 2, constructing a video super-resolution neural network, which comprises the following steps: the system comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;

the ith frame low resolution image x_iInputting the image into the pixel attention module, and processing the image by using the 1 st convolution layer respectively to obtain the i frame low-resolution image x_iKey of_iSum value_i(ii) a Wherein, key_iThrough the treatment of the 2 nd convolution layer, the product is obtainedA to dictionary key; at the same time, key_iObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining an incidence matrix A after performing product operation on the dictionary key and the query key_i,key；

Value_iAfter 1 convolutional layer, the processed value is obtained_i', and associated with the correlation matrix A_i,keyMultiplying to obtain the ith low-resolution image x_iImage feature of

C represents the number of channels;

step 2.2.1 sequence the ith event e_iAfter 3D voxelization, averagely dividing the voxelization data into H blocks, then carrying out averaging processing on each block to obtain H sampled event key points, and marking the H event key point as an H event key point

Step 2.2.2 Key Point for h event

Is set as an event feature flat_i,hAt the h-th event key point

The region outside as its neighborhood

Step 2.2.3 Critical Point to h-th event Using equation (1)

Corresponding event feature feat_i,hPerforming an update to obtain an updatedH event key point

Event feature feat'_i,h：

feat′_i,h＝Up(Conv(feat_i,h)) (1)

step 2.3 the event map module utilizes equation (2) for event features { feat'_i,h}_HPerforming stereo graphic operation to obtain the ith event sequence e_iGlobal event feature of

represents a product function and has:

image features

And global event features

representing global event features

The G-th group of characteristics, G represents the number of groups;

As input data for the p-th iteration;

step 2.5.4, after P +1 is assigned to P, judging whether P is more than P, if so, inputting the input data of the (P + 1) th iteration into the decoding module for processing, thereby obtaining the j frame high-resolution predicted image

Otherwise, the step 2.5.2 is returned to be executed in sequence;

In the formula (3), R is the j-th frame high-resolution predicted image

The number of the pixel points of (a),

step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating a loss function L_MSEWhile at the same time at a learning rate lr_sUpdating the network weight, stopping training when the training iteration number reaches the set number or the loss error is less than the set threshold value, therebyObtaining an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.