CN114612305A - Event-driven video super-resolution method based on stereogram modeling - Google Patents

Event-driven video super-resolution method based on stereogram modeling Download PDF

Info

Publication number
CN114612305A
CN114612305A CN202210245281.1A CN202210245281A CN114612305A CN 114612305 A CN114612305 A CN 114612305A CN 202210245281 A CN202210245281 A CN 202210245281A CN 114612305 A CN114612305 A CN 114612305A
Authority
CN
China
Prior art keywords
event
resolution
image
module
ith
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210245281.1A
Other languages
Chinese (zh)
Other versions
CN114612305B (en
Inventor
查正军
傅雪阳
曹成志
时格格
朱禹睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Science and Technology of China USTC
Original Assignee
University of Science and Technology of China USTC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Science and Technology of China USTC filed Critical University of Science and Technology of China USTC
Priority to CN202210245281.1A priority Critical patent/CN114612305B/en
Publication of CN114612305A publication Critical patent/CN114612305A/en
Application granted granted Critical
Publication of CN114612305B publication Critical patent/CN114612305B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4053Super resolution, i.e. output image resolution higher than sensor resolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformation in the plane of the image
    • G06T3/40Scaling the whole image or part thereof
    • G06T3/4046Scaling the whole image or part thereof using neural networks

Abstract

The invention discloses an event-driven video super-resolution method based on stereogram modeling, which comprises the following steps: 1. acquiring video data and a corresponding event sequence, and segmenting the event sequence; 2. constructing a pixel attention module to perform feature extraction on the image features; 3. re-sampling the adjacent area of the initial event through a sampling module, and repeatedly adjusting the characteristics of the sampling event; 4. performing stereogram modeling on each sampling event by using a neighborhood through an event graph module to gather local features in the neighborhood and gradually increase the receptive field in the whole stereostream; 5. and enabling the event characteristic and the image characteristic to interact through the characteristic interaction module. The method can fully utilize the prior information provided by the event data to drive the video super-resolution, thereby effectively improving the super-resolution effect.

Description

Event-driven video super-resolution method based on stereogram modeling
Technical Field
The invention relates to the field of video super-resolution, in particular to an event-driven video super-resolution method based on stereogram modeling.
Background
Video, which is an important data source in computer vision communication, inevitably has blur due to the motion of objects, thereby affecting the subjective quality of experience and further application. Video deblurring has attracted considerable attention in order to eliminate the adverse effects of blurring. Due to the significant loss of motion information during the blurring process, it is not feasible to recover sharp video sequences from motion blurred images. Recently, a new sensor called an event camera has been proposed for recording and capturing scene intensity changes on the order of microseconds for which fast motion can be captured as events at a high time rate, providing new opportunities for exploring solutions for video deblurring.
Video, which is an important data source in computer visual communication, inevitably reduces image quality due to various external factors, which affects subjective quality and application breadth. In order to improve the definition of images, video super-resolution has attracted much attention. Recently, a new type of sensor called an event camera has been proposed for recording and capturing microsecond-level scene intensity variations. For event cameras, fast motion can be captured as high time rate events, which provides new opportunities for exploring solutions for video super-resolution.
The event camera is a new type of sensor that can asynchronously record brightness changes by adding or sub-tracking signals with microsecond precision using predefined thresholds. Due to low latency, low cost and high temporal resolution, the event camera can be applied to many applications, such as video interpolation, super-resolution, deblurring, intensity image reconstruction and event stream de-noising. However, most commercial cameras produce a relatively low resolution stream to achieve high efficiency. Because high-resolution images/videos are more beneficial to computer vision tasks such as gesture recognition, target tracking and classification, more and more researchers pay attention to the super-resolution of the event-guided intensity frame and have achieved academic success.
Generally, event-based super-resolution methods can be classified into three categories: (1) combining a network for transmitting the event to the intensity image with a super-resolution algorithm; (2) an HR intensity image is constructed by directly super-resolution event flow without intensity assistance; (3) the mixed signal (e.g., APS frame and event) is taken as input to achieve spatial resolution in the intensity image. However, these pipeline methods typically compress the event stream into event frames that have the same channel and scale as the video frames. This strategy makes the spatial correlation in the stereo event stream underutilized, producing unrealistic results. In addition, these methods also use the same process to process frames and events to extract features, ignoring important distinctions between sparse event streams and dense video frames. Thus, sparse features in the event cannot be applied correctly, since the shared weights in the local perceptual field and convolutional layer distort most of the area. These problems limit further development of event-based video super-resolution principle research.
Disclosure of Invention
In order to overcome the defects of the existing method, the invention provides the event-driven video super-resolution method based on the stereogram modeling, so that better deblurring performance can be achieved in video super-resolution tasks of different scenes.
In order to solve the technical problems, the invention adopts the following technical scheme:
the event-driven video super-resolution method based on stereogram modeling is characterized by comprising the following steps of:
step 1, acquiring training video data and a corresponding event sequence, and segmenting the event sequence:
step 1.1 obtainingTaking a real low-resolution video image set, storing the real low-resolution video image set by taking a frame as a unit, and recording the real low-resolution video image set as X ═ { X ═ X1,x2,...,xi,...,xNIn which xiThe image processing method includes the steps that an ith frame low-resolution image is represented, wherein i is 1,2, the number of frames of the low-resolution image is N, and N is the number of the frames of the low-resolution image; obtaining an event sequence corresponding to a low-resolution video image set X, dividing the event sequence into a corresponding number of event sequences according to the number of frames N, and recording the event sequences as E ═ { E }1,e2,...,ei,...,eN},eiRepresenting an event sequence corresponding to the ith frame of low-resolution image;
acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y1,y2,...,yj,...,yNIn which yjRepresenting a j frame clear image;
let I ═ { X, E, Y } denote the training image dataset;
step 2, constructing a video super-resolution neural network, comprising the following steps: the system comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;
step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:
the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the convolution kernel size in each convolution layer is ks, and the step length is s;
the ith frame low resolution image xiInputting the image into the pixel attention module, and processing the image by using the 1 st convolution layer respectively to obtain the i frame low-resolution image xiKey of (2)iSum valuei(ii) a Wherein, the keyiObtaining dictionary keys through the processing of the 2 nd convolutional layer; at the same time, keyiObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining an incidence matrix A after performing product operation on the dictionary key and the query keyi,key
ValueiAfter passing through 1 convolution layer, a processed value of'iAnd associated with the incidence matrix Ai,keyMultiplying to obtain the ith low resolutionRate image xiImage feature of
Figure BDA0003544972550000021
C represents the number of channels;
step 2.2 event sequence e corresponding to i-th frame low resolution imageiInputting the data into the sampling module for sampling:
step 2.2.1 sequence e of the ith eventiAfter 3D voxelization, averagely dividing the voxelization into H blocks, then carrying out average processing on each block to obtain H sampled event key points, and marking the H-th event key point as an H-th event key point
Figure BDA0003544972550000022
Step 2.2.2 Key points to the h-th event
Figure BDA0003544972550000031
Is set as an event feature flati,hAt the h-th event key point
Figure BDA0003544972550000032
The region outside as its neighborhood
Figure BDA0003544972550000033
Step 2.2.3 Critical Point to h event Using equation (1)
Figure BDA0003544972550000034
Corresponding event feature feati,hUpdating to obtain the updated h event key point
Figure BDA0003544972550000035
Event feature feat'i,h
feat′i,h=Up(Conv(feati,h)) (1)
Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;
step 2.3 event graph Module utilization(2) Event feature { feat 'for all event key points'i,h}HPerforming stereo graphic operation to obtain the ith event sequence eiGlobal event feature of
Figure BDA0003544972550000036
Figure BDA0003544972550000037
(ii) in formula (2) { feat'i,h}HRepresents the ith event sequence eiEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'i,qEvent features representing the qth event key point;
Figure BDA0003544972550000038
represents a product function and has:
Figure BDA0003544972550000039
step 2.4, constructing the feature interaction module and fusing the event features and the image features:
the feature interaction module comprises x downsampling layers and y convolution layers sharing weight;
image features
Figure BDA00035449725500000310
And global event features
Figure BDA00035449725500000311
Inputting the common features of the g-th group into the feature interaction module, obtaining the common features of the g-th group by using an equation (4) after x downsampling layers and y convolution layers sharing weights
Figure BDA00035449725500000312
Thereby obtaining the ith event sequence eiAnd the ith low resolution image xiAre all characterized in
Figure BDA00035449725500000313
Figure BDA00035449725500000314
In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,
Figure BDA00035449725500000315
representing global event features
Figure BDA00035449725500000316
The G-th group of characteristics, G represents the number of groups;
step 2.5, the feature interaction module is used for outputting a final high-resolution image;
step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence eiAnd the ith low-resolution image xiAre all characterized in
Figure BDA00035449725500000317
As input data for the p-th iteration;
step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;
2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by the characteristic interaction module, obtaining input data of the (p + 1) th iteration;
step 2.5.4 assigning p +1 to p, then judging p>If the P is established, inputting the input data of the (P + 1) th iteration into the decoding module for processing so as to obtain a j frame high-resolution predicted image
Figure BDA0003544972550000041
Otherwise, the sequence returns to step 2.5.2Executing;
step 3, constructing a back propagation loss function L by using the formula (5)MSE
Figure BDA0003544972550000042
In the formula (3), R is the j-th frame high-resolution predicted image
Figure BDA0003544972550000043
The number of the pixel points of (a),
Figure BDA0003544972550000044
for the ith low resolution image xiThe r-th pixel point of the high-resolution image generated by the neural network,
Figure BDA0003544972550000045
for the ith high resolution image Y in the high resolution video image set YiThe corresponding r-th pixel point;
step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating a loss function LMSEWhile at the same time at a learning rate lrsUpdating the network weight, and stopping training when the training iteration times reach the set times or the loss error is less than the set threshold value, so as to obtain an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.
Compared with the prior art, the invention has the beneficial effects that:
1. the method utilizes event data to drive the video super-resolution task, can realize good end-to-end super-resolution effect under the condition of small parameter number, reduces the parameter number compared with the prior method, and has better robustness on different data sets. Experimental results show that the method provided by the invention is superior to the most advanced method in synthesizing a data set and a real data set.
2. The invention perceives the temporal correlation between adjacent event sequences through a stereo event feature mechanism. To take advantage of the high temporal resolution information provided by events for adjusting the coordinates of sampled events, capturing the correlation of neighboring regions, and perceiving long-term correlations in the overall event stream.
3. The time memory module is used for calculating the long-term correlation of different events so as to recover the correlation of time events. The final deblurring network is constructed based on the two blocks and is trained in an end-to-end mode; the similarity between the query and the key is used to measure a temporal non-local correspondence to the current event, which will generate a corresponding value to perceive the temporal change; obtaining a correlation matrix of the event at the T moment and the adjacent event sequence through product operation, and using the correlation matrix to fuse the event characteristics; the time information between different event sequences will be continuously recovered.
4. The present invention uses a feature interaction module to fuse image features and event features. With additional information in the event, more details are provided for frame super-resolution, and the frame features can iteratively and adaptively fine-tune the event features. By modeling the global relation of the space and the channel, the global information of the input features is deeply mined, so that the super-resolution performance of the image is improved, and the interpretability of the model is increased.
Drawings
FIG. 1 is a diagram of a super-resolution method of event-driven video based on stereogram modeling according to the present invention;
FIG. 2 is a block diagram of a sampling module according to the present invention;
FIG. 3 is a block diagram of an event graph module according to the present invention;
FIG. 4 is a block diagram of a feature interaction module of the present invention;
FIG. 5 is a flow chart of the inventive method.
Detailed Description
In this embodiment, a specific flow of the event-driven video super-resolution method based on stereogram modeling is shown in fig. 1, which is to comprehensively consider features of event data and a video sequence, and make full use of prior information provided by the event data to drive video super-resolution, so that a super-resolution effect can be effectively improved, and an algorithm structure diagram of the whole method is shown in fig. 2. Specifically, the method comprises the following steps:
step 1, acquiring training video data and a corresponding event sequence, and segmenting the event sequence:
step 1.1, a real low-resolution video image set is obtained and stored in units of frames, and is recorded as X ═ X1,x2,...,xi,...,xNIn which xiRepresenting the ith frame of low-resolution image, wherein i is 1,2, and N is the frame number of the low-resolution image; obtaining an event sequence corresponding to the low-resolution video image set X, and dividing the event sequence into a corresponding number of event sequences according to the frame number N, and recording the event sequences as E ═ { E }1,e2,...,ei,...,eN},eiRepresenting an event sequence corresponding to the ith frame of low-resolution image;
acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y1,y2,...,yj,...,yNIn which yjRepresenting a clear image of a j frame;
let I ═ { X, E, Y } denote the training image dataset;
in this embodiment, an NFS data set training and evaluation model is used, which includes 100 video sequences of different scenes, 80 of which are selected for training the model, and the rest of which are used for evaluating the model;
step 2, constructing a video super-resolution neural network, comprising the following steps: the device comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;
step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:
the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the sizes of convolution kernels in the convolution layers are ks, and the step lengths are s;
ith frame low resolution image xiInputting into pixel attention module, and processing with 1 st convolution layer to obtain i frame low resolutionImage xiKey of (2)iSum valuei(ii) a Wherein, the keyiObtaining dictionary keys through the processing of the 2 nd convolutional layer; at the same time, keyiObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining the incidence matrix A after carrying out product operation on the dictionary key and the query keyi,key
ValueiAfter passing through 1 convolution layer, a processed value of'iAnd associated with the incidence matrix Ai,keyMultiplying to obtain the ith low-resolution image xiImage feature of
Figure BDA0003544972550000061
C represents the number of channels;
in this embodiment, the size of the convolution kernel is3 × 3, the step size is 1, and C is 32;
step 2.2 event sequence e corresponding to ith frame of low-resolution imageiThe sampling module is input to perform sampling processing, and the specific structure of the sampling module is shown in fig. 3:
step 2.2.1 sequence the ith event eiAfter 3D voxelization, averagely dividing the voxelization into H blocks, then carrying out average processing on each block to obtain H sampled event key points, and marking the H-th event key point as an H-th event key point
Figure BDA0003544972550000062
In this embodiment, H is 512;
step 2.2.2 Key Point for h event
Figure BDA0003544972550000063
Is set as an event feature flati,hAt the h-th event key point
Figure BDA0003544972550000064
As its neighborhood
Figure BDA0003544972550000065
Step 2.2.3 alignment of the second step with the formula (1)h event key points
Figure BDA0003544972550000066
Corresponding event feature feati,hUpdating to obtain the updated h-th event key point
Figure BDA0003544972550000067
Event feature feat'i,h
feat′i,h=Up(Conv(feati,h)) (1)
Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;
step 2.3 event map Module event features { feat 'for all event keypoints with equation (2)'i,h}HPerforming stereo graphic operation to obtain the ith event sequence eiGlobal event feature of
Figure BDA0003544972550000068
Figure BDA0003544972550000069
(ii) in formula (2) { feat'i,h}HRepresents the ith event sequence eiEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'i,qEvent features representing the qth event key point;
Figure BDA00035449725500000610
represents a product function and has:
Figure BDA00035449725500000611
wherein the event map module is shown in FIG. 4;
step 2.4, constructing a feature interaction module and fusing the event features and the image features:
the characteristic interaction module comprises x downsampling layers and y convolution layers sharing weight;
image features
Figure BDA0003544972550000071
And global event features
Figure BDA0003544972550000072
Inputting the common characteristics of the g-th group obtained by the formula (3) after x downsampling layers and y convolution layers sharing weight in the characteristic interaction module
Figure BDA0003544972550000073
Thereby obtaining the ith event sequence eiAnd the ith low resolution image xiAre all characterised by
Figure BDA0003544972550000074
Figure BDA0003544972550000075
In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,
Figure BDA0003544972550000076
representing global event features
Figure BDA0003544972550000077
The G-th group of characteristics, G represents the number of groups; in this embodiment, x is 1, y is3, and G is 4;
step 2.5, the feature interaction module is used for outputting a final high-resolution image;
step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence eiAnd the ith low resolution image xiAre all characterized in
Figure BDA0003544972550000078
As input data for the p-th iteration; in this embodiment, pTaking 8;
step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;
2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by a characteristic interaction module, obtaining input data of the (p + 1) th iteration;
step 2.5.4 assigning p +1 to p, then judging p>If the P is established, inputting the input data of the (P + 1) th iteration into a decoding module for processing so as to obtain a j frame high-resolution predicted image
Figure BDA0003544972550000079
Otherwise, the step 2.5.2 is returned to be executed in sequence;
the specific structure of the feature interaction module is shown in fig. 5;
step 3, constructing a back propagation loss function L by using the formula (3)MSE
Figure BDA00035449725500000710
In the formula (3), R is the j-th frame high-resolution predicted image
Figure BDA00035449725500000711
The number of the pixel points of (a),
Figure BDA00035449725500000712
for the ith low resolution image xiThe r-th pixel point of the high-resolution image generated by the neural network,
Figure BDA00035449725500000713
for the ith high resolution image Y in the high resolution video image set YiThe corresponding r-th pixel point;
step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating the lossFunction LMSEWhile at the same time at a learning rate lrsTo update the network weight, in this embodiment, the learning rate lrsAnd taking 5 e-5. When the training iteration times reach the set times or the loss error is smaller than the set threshold value, stopping training so as to obtain an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence thereof by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.
Examples
In order to verify the effectiveness of the method, a common synthetic data set and a common real data set are selected for training and testing.
In view of the proposed end-to-end training of SGM networks, a data set is required containing LR strength frames, corresponding event streams and HR strength frame inputs. In this embodiment, event data is generated using v2e using a network trained using a composite dataset. The NFS dataset and GoPro dataset at high frame rate and high resolution are used as input sources. Thus, high resolution intensity images are readily available. To simulate a real APS frame, the frame size of the video is reduced to 128 × 128 in this embodiment to generate an LR event stream using V2E. The corresponding HR intensity frame is simply down-sampled according to a training high scale factor (x 2 or x 4). The composite data set generated 3828 data tuples from the 132 video sequences. In order to improve the generalization capability of the network to real event data, sampling is carried out according to normal distribution with the mean value of 0.15 and the standard deviation of 0.03, and positive and negative contrast thresholds are randomly set when an event is generated. For the test data set, a composite data set and a real data set are also set to evaluate the method. The composite test data set consists of 841 intensity images and a corresponding stream of analog events between two consecutive frames in 19 videos. For real-world test datasets, an HQF dataset was selected, consisting of real events and low resolution frames in different outdoor and indoor scenes, captured by a DAVIS346 camera.
The structural similarity (PSNR) and the peak signal-to-noise ratio (SSIM) and the perceptual image block similarity (LPIPS) are used as evaluation indexes.
In the embodiment, five methods are selected for effect comparison with the method provided by the invention, and the selected methods are DPT, E2SRI, DCSR, EvIntSR, eSL-Net, SPADE and SGM-Net respectively.
The results obtained from the experimental results are shown in tables 1 and 2:
TABLE 1 Experimental results of experiments conducted on SR x2 using the method of the present invention and six selected comparative methods
Methods DPT E2SRI DCSR EvIntSR eSL-Net SPADE SGM-Net
PSNR 26.10 23.05 25.06 23.13 28.41 23.89 30.77
SSIM 0.874 0.784 0.804 0.776 0.880 0.773 0.913
LPIPS 0.107 0.192 0.121 0.151 0.092 0.139 0.063
TABLE 2 Experimental results of experiments on SR x4 using the method of the present invention and six selected comparative methods
Methods DPT E2SRI DCSR EvIntSR eSL-Net SPADE SGM-Net
PSNR 25.76 21.06 19.51 23.25 26.80 21.11 28.40
SSIM 0.841 0.729 0.688 0.745 0.869 0.701 0.897
LPIPS 0.088 0.192 0.229 0.149 0.099 0.191 0.082
As can be seen from tables 1 and 2, when the super-resolution is doubled and quadrupled on the same data set, the method of the present invention has better effect than other six methods, thereby proving the feasibility of the method proposed by the present invention. Experiments show that the method provided by the invention can fully utilize prior information provided by event data to complete a super-resolution task of the video.

Claims (1)

1. An event-driven video super-resolution method based on stereogram modeling is characterized by comprising the following steps:
step 1, acquiring training video data and a corresponding event sequence, and segmenting the event sequence:
step 1.1, a real low-resolution video image set is obtained and stored in units of frames, and is recorded as X ═ X1,x2,...,xi,...,xNIn which xiRepresenting the ith frame of low-resolution image, wherein i is 1,2, and N is the frame number of the low-resolution image; obtaining an event sequence corresponding to a low-resolution video image set X, dividing the event sequence into a corresponding number of event sequences according to the number of frames N, and recording the event sequences as E ═ { E }1,e2,...,ei,...,eN},eiRepresenting an event sequence corresponding to the ith frame of low-resolution image;
acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y1,y2,...,yj,...,yNIn which yjRepresenting a j frame clear image;
let I ═ { X, E, Y } denote the training image dataset;
step 2, constructing a video super-resolution neural network, which comprises the following steps: the system comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;
step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:
the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the convolution kernel size in each convolution layer is ks, and the step length is s;
the ith frame low resolution image xiInputting the image into the pixel attention module, and processing the image by using the 1 st convolution layer respectively to obtain the i frame low-resolution image xiKey ofiSum valuei(ii) a Wherein, keyiThrough the treatment of the 2 nd convolution layer, the product is obtainedA to dictionary key; at the same time, keyiObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining an incidence matrix A after performing product operation on the dictionary key and the query keyi,key
ValueiAfter 1 convolutional layer, the processed value is obtainedi', and associated with the correlation matrix Ai,keyMultiplying to obtain the ith low-resolution image xiImage feature of
Figure FDA0003544972540000011
C represents the number of channels;
step 2.2 event sequence e corresponding to i-th frame low resolution imageiInputting the data into the sampling module for sampling:
step 2.2.1 sequence the ith event eiAfter 3D voxelization, averagely dividing the voxelization data into H blocks, then carrying out averaging processing on each block to obtain H sampled event key points, and marking the H event key point as an H event key point
Figure FDA0003544972540000012
Step 2.2.2 Key Point for h event
Figure FDA0003544972540000013
Is set as an event feature flati,hAt the h-th event key point
Figure FDA0003544972540000014
The region outside as its neighborhood
Figure FDA0003544972540000015
Step 2.2.3 Critical Point to h-th event Using equation (1)
Figure FDA0003544972540000016
Corresponding event feature feati,hPerforming an update to obtain an updatedH event key point
Figure FDA0003544972540000021
Event feature feat'i,h
feat′i,h=Up(Conv(feati,h)) (1)
Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;
step 2.3 the event map module utilizes equation (2) for event features { feat'i,h}HPerforming stereo graphic operation to obtain the ith event sequence eiGlobal event feature of
Figure FDA0003544972540000022
Figure FDA0003544972540000023
(ii) in formula (2) { feat'i,h}HRepresents the ith event sequence eiEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'i,qEvent features representing the qth event key point;
Figure FDA0003544972540000024
represents a product function and has:
Figure FDA0003544972540000025
step 2.4, constructing the feature interaction module and fusing the event features and the image features:
the feature interaction module comprises x downsampling layers and y convolution layers sharing weight;
image features
Figure FDA0003544972540000026
And global event features
Figure FDA0003544972540000027
Inputting the common features of the g-th group into the feature interaction module, obtaining the common features of the g-th group by using an equation (4) after x downsampling layers and y convolution layers sharing weights
Figure FDA0003544972540000028
Thereby obtaining the ith event sequence eiAnd the ith low resolution image xiAre all characterized in
Figure FDA0003544972540000029
Figure FDA00035449725400000210
In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,
Figure FDA00035449725400000211
representing global event features
Figure FDA00035449725400000212
The G-th group of characteristics, G represents the number of groups;
step 2.5, the feature interaction module is used for outputting a final high-resolution image;
step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence eiAnd the ith low resolution image xiAre all characterized in
Figure FDA00035449725400000213
As input data for the p-th iteration;
step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;
2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by the characteristic interaction module, obtaining input data of the (p + 1) th iteration;
step 2.5.4, after P +1 is assigned to P, judging whether P is more than P, if so, inputting the input data of the (P + 1) th iteration into the decoding module for processing, thereby obtaining the j frame high-resolution predicted image
Figure FDA00035449725400000214
Otherwise, the step 2.5.2 is returned to be executed in sequence;
step 3, constructing a back propagation loss function L by using the formula (5)MSE
Figure FDA0003544972540000031
In the formula (3), R is the j-th frame high-resolution predicted image
Figure FDA0003544972540000032
The number of the pixel points of (a),
Figure FDA0003544972540000033
for the ith low resolution image xiThe r-th pixel point of the high-resolution image generated by the neural network,
Figure FDA0003544972540000034
for the ith high resolution image Y in the high resolution video image set YiThe corresponding r-th pixel point;
step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating a loss function LMSEWhile at the same time at a learning rate lrsUpdating the network weight, stopping training when the training iteration number reaches the set number or the loss error is less than the set threshold value, therebyObtaining an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.
CN202210245281.1A 2022-03-14 2022-03-14 Event-driven video super-resolution method based on stereogram modeling Active CN114612305B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210245281.1A CN114612305B (en) 2022-03-14 2022-03-14 Event-driven video super-resolution method based on stereogram modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210245281.1A CN114612305B (en) 2022-03-14 2022-03-14 Event-driven video super-resolution method based on stereogram modeling

Publications (2)

Publication Number Publication Date
CN114612305A true CN114612305A (en) 2022-06-10
CN114612305B CN114612305B (en) 2024-04-02

Family

ID=81863410

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210245281.1A Active CN114612305B (en) 2022-03-14 2022-03-14 Event-driven video super-resolution method based on stereogram modeling

Country Status (1)

Country Link
CN (1) CN114612305B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862122A (en) * 2022-12-27 2023-03-28 北京衔微医疗科技有限公司 Fundus image acquisition method, fundus image acquisition device, computer equipment and readable storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2711892A2 (en) * 2012-09-24 2014-03-26 Vision Semantics Limited Improvements in resolving video content
CN111667442A (en) * 2020-05-21 2020-09-15 武汉大学 High-quality high-frame-rate image reconstruction method based on event camera
US20210209731A1 (en) * 2020-01-03 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Video processing method, apparatus, device and storage medium
CN113610707A (en) * 2021-07-23 2021-11-05 广东工业大学 Video super-resolution method based on time attention and cyclic feedback network

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2711892A2 (en) * 2012-09-24 2014-03-26 Vision Semantics Limited Improvements in resolving video content
US20210209731A1 (en) * 2020-01-03 2021-07-08 Beijing Baidu Netcom Science And Technology Co., Ltd. Video processing method, apparatus, device and storage medium
CN111667442A (en) * 2020-05-21 2020-09-15 武汉大学 High-quality high-frame-rate image reconstruction method based on event camera
CN113610707A (en) * 2021-07-23 2021-11-05 广东工业大学 Video super-resolution method based on time attention and cyclic feedback network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘娜;李翠华;: "基于多层卷积神经网络学习的单帧图像超分辨率重建方法", 中国科技论文, no. 02 *
刘村;李元祥;周拥军;骆建华;: "基于卷积神经网络的视频图像超分辨率重建方法", 计算机应用研究, no. 04 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862122A (en) * 2022-12-27 2023-03-28 北京衔微医疗科技有限公司 Fundus image acquisition method, fundus image acquisition device, computer equipment and readable storage medium

Also Published As

Publication number Publication date
CN114612305B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
Zhang et al. Recursive neural network for video deblurring
Zhang et al. Multi-level fusion and attention-guided CNN for image dehazing
US20220222776A1 (en) Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution
Huang et al. Self-filtering image dehazing with self-supporting module
CN113011329B (en) Multi-scale feature pyramid network-based and dense crowd counting method
CN110580472B (en) Video foreground detection method based on full convolution network and conditional countermeasure network
CN111723693B (en) Crowd counting method based on small sample learning
CN114463218B (en) Video deblurring method based on event data driving
CN110443761B (en) Single image rain removing method based on multi-scale aggregation characteristics
Wang et al. Video deblurring via spatiotemporal pyramid network and adversarial gradient prior
CN111445418A (en) Image defogging method and device and computer equipment
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN111369548A (en) No-reference video quality evaluation method and device based on generation countermeasure network
CN105931189B (en) Video super-resolution method and device based on improved super-resolution parameterized model
Zhang et al. Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention
Zhang et al. Learning to restore light fields under low-light imaging
Zhou et al. PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes
CN109871790B (en) Video decoloring method based on hybrid neural network model
CN114612305B (en) Event-driven video super-resolution method based on stereogram modeling
Tang et al. Structure-embedded ghosting artifact suppression network for high dynamic range image reconstruction
Wan et al. Progressive convolutional transformer for image restoration
Cui et al. Multi-stream attentive generative adversarial network for dynamic scene deblurring
CN116403152A (en) Crowd density estimation method based on spatial context learning network
CN112862723B (en) Real image denoising method based on pseudo-3D autocorrelation network
Xue et al. Bwin: A bilateral warping method for video frame interpolation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant