CN114612305A - Event-driven video super-resolution method based on stereogram modeling - Google Patents
Event-driven video super-resolution method based on stereogram modeling Download PDFInfo
- Publication number
- CN114612305A CN114612305A CN202210245281.1A CN202210245281A CN114612305A CN 114612305 A CN114612305 A CN 114612305A CN 202210245281 A CN202210245281 A CN 202210245281A CN 114612305 A CN114612305 A CN 114612305A
- Authority
- CN
- China
- Prior art keywords
- event
- resolution
- image
- module
- ith
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000003993 interaction Effects 0.000 claims abstract description 22
- 238000005070 sampling Methods 0.000 claims abstract description 16
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 238000012545 processing Methods 0.000 claims description 21
- 238000012549 training Methods 0.000 claims description 20
- 230000006870 function Effects 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 239000011541 reaction mixture Substances 0.000 claims description 3
- 238000012935 Averaging Methods 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 6
- 238000010586 diagram Methods 0.000 description 5
- 230000002123 temporal effect Effects 0.000 description 5
- 239000002131 composite material Substances 0.000 description 4
- 238000012360 testing method Methods 0.000 description 4
- 238000002474 experimental method Methods 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004891 communication Methods 0.000 description 2
- 238000012733 comparative method Methods 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 230000002411 adverse Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000013210 evaluation model Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000004576 sand Substances 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4053—Super resolution, i.e. output image resolution higher than sensor resolution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T3/00—Geometric image transformation in the plane of the image
- G06T3/40—Scaling the whole image or part thereof
- G06T3/4046—Scaling the whole image or part thereof using neural networks
Abstract
The invention discloses an event-driven video super-resolution method based on stereogram modeling, which comprises the following steps: 1. acquiring video data and a corresponding event sequence, and segmenting the event sequence; 2. constructing a pixel attention module to perform feature extraction on the image features; 3. re-sampling the adjacent area of the initial event through a sampling module, and repeatedly adjusting the characteristics of the sampling event; 4. performing stereogram modeling on each sampling event by using a neighborhood through an event graph module to gather local features in the neighborhood and gradually increase the receptive field in the whole stereostream; 5. and enabling the event characteristic and the image characteristic to interact through the characteristic interaction module. The method can fully utilize the prior information provided by the event data to drive the video super-resolution, thereby effectively improving the super-resolution effect.
Description
Technical Field
The invention relates to the field of video super-resolution, in particular to an event-driven video super-resolution method based on stereogram modeling.
Background
Video, which is an important data source in computer vision communication, inevitably has blur due to the motion of objects, thereby affecting the subjective quality of experience and further application. Video deblurring has attracted considerable attention in order to eliminate the adverse effects of blurring. Due to the significant loss of motion information during the blurring process, it is not feasible to recover sharp video sequences from motion blurred images. Recently, a new sensor called an event camera has been proposed for recording and capturing scene intensity changes on the order of microseconds for which fast motion can be captured as events at a high time rate, providing new opportunities for exploring solutions for video deblurring.
Video, which is an important data source in computer visual communication, inevitably reduces image quality due to various external factors, which affects subjective quality and application breadth. In order to improve the definition of images, video super-resolution has attracted much attention. Recently, a new type of sensor called an event camera has been proposed for recording and capturing microsecond-level scene intensity variations. For event cameras, fast motion can be captured as high time rate events, which provides new opportunities for exploring solutions for video super-resolution.
The event camera is a new type of sensor that can asynchronously record brightness changes by adding or sub-tracking signals with microsecond precision using predefined thresholds. Due to low latency, low cost and high temporal resolution, the event camera can be applied to many applications, such as video interpolation, super-resolution, deblurring, intensity image reconstruction and event stream de-noising. However, most commercial cameras produce a relatively low resolution stream to achieve high efficiency. Because high-resolution images/videos are more beneficial to computer vision tasks such as gesture recognition, target tracking and classification, more and more researchers pay attention to the super-resolution of the event-guided intensity frame and have achieved academic success.
Generally, event-based super-resolution methods can be classified into three categories: (1) combining a network for transmitting the event to the intensity image with a super-resolution algorithm; (2) an HR intensity image is constructed by directly super-resolution event flow without intensity assistance; (3) the mixed signal (e.g., APS frame and event) is taken as input to achieve spatial resolution in the intensity image. However, these pipeline methods typically compress the event stream into event frames that have the same channel and scale as the video frames. This strategy makes the spatial correlation in the stereo event stream underutilized, producing unrealistic results. In addition, these methods also use the same process to process frames and events to extract features, ignoring important distinctions between sparse event streams and dense video frames. Thus, sparse features in the event cannot be applied correctly, since the shared weights in the local perceptual field and convolutional layer distort most of the area. These problems limit further development of event-based video super-resolution principle research.
Disclosure of Invention
In order to overcome the defects of the existing method, the invention provides the event-driven video super-resolution method based on the stereogram modeling, so that better deblurring performance can be achieved in video super-resolution tasks of different scenes.
In order to solve the technical problems, the invention adopts the following technical scheme:
the event-driven video super-resolution method based on stereogram modeling is characterized by comprising the following steps of:
step 1.1 obtainingTaking a real low-resolution video image set, storing the real low-resolution video image set by taking a frame as a unit, and recording the real low-resolution video image set as X ═ { X ═ X1,x2,...,xi,...,xNIn which xiThe image processing method includes the steps that an ith frame low-resolution image is represented, wherein i is 1,2, the number of frames of the low-resolution image is N, and N is the number of the frames of the low-resolution image; obtaining an event sequence corresponding to a low-resolution video image set X, dividing the event sequence into a corresponding number of event sequences according to the number of frames N, and recording the event sequences as E ═ { E }1,e2,...,ei,...,eN},eiRepresenting an event sequence corresponding to the ith frame of low-resolution image;
acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y1,y2,...,yj,...,yNIn which yjRepresenting a j frame clear image;
let I ═ { X, E, Y } denote the training image dataset;
step 2, constructing a video super-resolution neural network, comprising the following steps: the system comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;
step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:
the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the convolution kernel size in each convolution layer is ks, and the step length is s;
the ith frame low resolution image xiInputting the image into the pixel attention module, and processing the image by using the 1 st convolution layer respectively to obtain the i frame low-resolution image xiKey of (2)iSum valuei(ii) a Wherein, the keyiObtaining dictionary keys through the processing of the 2 nd convolutional layer; at the same time, keyiObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining an incidence matrix A after performing product operation on the dictionary key and the query keyi,key;
ValueiAfter passing through 1 convolution layer, a processed value of'iAnd associated with the incidence matrix Ai,keyMultiplying to obtain the ith low resolutionRate image xiImage feature ofC represents the number of channels;
step 2.2 event sequence e corresponding to i-th frame low resolution imageiInputting the data into the sampling module for sampling:
step 2.2.1 sequence e of the ith eventiAfter 3D voxelization, averagely dividing the voxelization into H blocks, then carrying out average processing on each block to obtain H sampled event key points, and marking the H-th event key point as an H-th event key point
Step 2.2.2 Key points to the h-th eventIs set as an event feature flati,hAt the h-th event key pointThe region outside as its neighborhood
Step 2.2.3 Critical Point to h event Using equation (1)Corresponding event feature feati,hUpdating to obtain the updated h event key pointEvent feature feat'i,h:
feat′i,h=Up(Conv(feati,h)) (1)
Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;
step 2.3 event graph Module utilization(2) Event feature { feat 'for all event key points'i,h}HPerforming stereo graphic operation to obtain the ith event sequence eiGlobal event feature of
(ii) in formula (2) { feat'i,h}HRepresents the ith event sequence eiEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'i,qEvent features representing the qth event key point;represents a product function and has:
step 2.4, constructing the feature interaction module and fusing the event features and the image features:
the feature interaction module comprises x downsampling layers and y convolution layers sharing weight;
image featuresAnd global event featuresInputting the common features of the g-th group into the feature interaction module, obtaining the common features of the g-th group by using an equation (4) after x downsampling layers and y convolution layers sharing weightsThereby obtaining the ith event sequence eiAnd the ith low resolution image xiAre all characterized in
In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,representing global event featuresThe G-th group of characteristics, G represents the number of groups;
step 2.5, the feature interaction module is used for outputting a final high-resolution image;
step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence eiAnd the ith low-resolution image xiAre all characterized inAs input data for the p-th iteration;
step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;
2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by the characteristic interaction module, obtaining input data of the (p + 1) th iteration;
step 2.5.4 assigning p +1 to p, then judging p>If the P is established, inputting the input data of the (P + 1) th iteration into the decoding module for processing so as to obtain a j frame high-resolution predicted imageOtherwise, the sequence returns to step 2.5.2Executing;
step 3, constructing a back propagation loss function L by using the formula (5)MSE:
In the formula (3), R is the j-th frame high-resolution predicted imageThe number of the pixel points of (a),for the ith low resolution image xiThe r-th pixel point of the high-resolution image generated by the neural network,for the ith high resolution image Y in the high resolution video image set YiThe corresponding r-th pixel point;
step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating a loss function LMSEWhile at the same time at a learning rate lrsUpdating the network weight, and stopping training when the training iteration times reach the set times or the loss error is less than the set threshold value, so as to obtain an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.
Compared with the prior art, the invention has the beneficial effects that:
1. the method utilizes event data to drive the video super-resolution task, can realize good end-to-end super-resolution effect under the condition of small parameter number, reduces the parameter number compared with the prior method, and has better robustness on different data sets. Experimental results show that the method provided by the invention is superior to the most advanced method in synthesizing a data set and a real data set.
2. The invention perceives the temporal correlation between adjacent event sequences through a stereo event feature mechanism. To take advantage of the high temporal resolution information provided by events for adjusting the coordinates of sampled events, capturing the correlation of neighboring regions, and perceiving long-term correlations in the overall event stream.
3. The time memory module is used for calculating the long-term correlation of different events so as to recover the correlation of time events. The final deblurring network is constructed based on the two blocks and is trained in an end-to-end mode; the similarity between the query and the key is used to measure a temporal non-local correspondence to the current event, which will generate a corresponding value to perceive the temporal change; obtaining a correlation matrix of the event at the T moment and the adjacent event sequence through product operation, and using the correlation matrix to fuse the event characteristics; the time information between different event sequences will be continuously recovered.
4. The present invention uses a feature interaction module to fuse image features and event features. With additional information in the event, more details are provided for frame super-resolution, and the frame features can iteratively and adaptively fine-tune the event features. By modeling the global relation of the space and the channel, the global information of the input features is deeply mined, so that the super-resolution performance of the image is improved, and the interpretability of the model is increased.
Drawings
FIG. 1 is a diagram of a super-resolution method of event-driven video based on stereogram modeling according to the present invention;
FIG. 2 is a block diagram of a sampling module according to the present invention;
FIG. 3 is a block diagram of an event graph module according to the present invention;
FIG. 4 is a block diagram of a feature interaction module of the present invention;
FIG. 5 is a flow chart of the inventive method.
Detailed Description
In this embodiment, a specific flow of the event-driven video super-resolution method based on stereogram modeling is shown in fig. 1, which is to comprehensively consider features of event data and a video sequence, and make full use of prior information provided by the event data to drive video super-resolution, so that a super-resolution effect can be effectively improved, and an algorithm structure diagram of the whole method is shown in fig. 2. Specifically, the method comprises the following steps:
step 1.1, a real low-resolution video image set is obtained and stored in units of frames, and is recorded as X ═ X1,x2,...,xi,...,xNIn which xiRepresenting the ith frame of low-resolution image, wherein i is 1,2, and N is the frame number of the low-resolution image; obtaining an event sequence corresponding to the low-resolution video image set X, and dividing the event sequence into a corresponding number of event sequences according to the frame number N, and recording the event sequences as E ═ { E }1,e2,...,ei,...,eN},eiRepresenting an event sequence corresponding to the ith frame of low-resolution image;
acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y1,y2,...,yj,...,yNIn which yjRepresenting a clear image of a j frame;
let I ═ { X, E, Y } denote the training image dataset;
in this embodiment, an NFS data set training and evaluation model is used, which includes 100 video sequences of different scenes, 80 of which are selected for training the model, and the rest of which are used for evaluating the model;
step 2, constructing a video super-resolution neural network, comprising the following steps: the device comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;
step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:
the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the sizes of convolution kernels in the convolution layers are ks, and the step lengths are s;
ith frame low resolution image xiInputting into pixel attention module, and processing with 1 st convolution layer to obtain i frame low resolutionImage xiKey of (2)iSum valuei(ii) a Wherein, the keyiObtaining dictionary keys through the processing of the 2 nd convolutional layer; at the same time, keyiObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining the incidence matrix A after carrying out product operation on the dictionary key and the query keyi,key;
ValueiAfter passing through 1 convolution layer, a processed value of'iAnd associated with the incidence matrix Ai,keyMultiplying to obtain the ith low-resolution image xiImage feature ofC represents the number of channels;
in this embodiment, the size of the convolution kernel is3 × 3, the step size is 1, and C is 32;
step 2.2 event sequence e corresponding to ith frame of low-resolution imageiThe sampling module is input to perform sampling processing, and the specific structure of the sampling module is shown in fig. 3:
step 2.2.1 sequence the ith event eiAfter 3D voxelization, averagely dividing the voxelization into H blocks, then carrying out average processing on each block to obtain H sampled event key points, and marking the H-th event key point as an H-th event key pointIn this embodiment, H is 512;
step 2.2.2 Key Point for h eventIs set as an event feature flati,hAt the h-th event key pointAs its neighborhood
Step 2.2.3 alignment of the second step with the formula (1)h event key pointsCorresponding event feature feati,hUpdating to obtain the updated h-th event key pointEvent feature feat'i,h:
feat′i,h=Up(Conv(feati,h)) (1)
Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;
step 2.3 event map Module event features { feat 'for all event keypoints with equation (2)'i,h}HPerforming stereo graphic operation to obtain the ith event sequence eiGlobal event feature of
(ii) in formula (2) { feat'i,h}HRepresents the ith event sequence eiEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'i,qEvent features representing the qth event key point;represents a product function and has:
wherein the event map module is shown in FIG. 4;
step 2.4, constructing a feature interaction module and fusing the event features and the image features:
the characteristic interaction module comprises x downsampling layers and y convolution layers sharing weight;
image featuresAnd global event featuresInputting the common characteristics of the g-th group obtained by the formula (3) after x downsampling layers and y convolution layers sharing weight in the characteristic interaction moduleThereby obtaining the ith event sequence eiAnd the ith low resolution image xiAre all characterised by
In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,representing global event featuresThe G-th group of characteristics, G represents the number of groups; in this embodiment, x is 1, y is3, and G is 4;
step 2.5, the feature interaction module is used for outputting a final high-resolution image;
step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence eiAnd the ith low resolution image xiAre all characterized inAs input data for the p-th iteration; in this embodiment, pTaking 8;
step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;
2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by a characteristic interaction module, obtaining input data of the (p + 1) th iteration;
step 2.5.4 assigning p +1 to p, then judging p>If the P is established, inputting the input data of the (P + 1) th iteration into a decoding module for processing so as to obtain a j frame high-resolution predicted imageOtherwise, the step 2.5.2 is returned to be executed in sequence;
the specific structure of the feature interaction module is shown in fig. 5;
step 3, constructing a back propagation loss function L by using the formula (3)MSE:
In the formula (3), R is the j-th frame high-resolution predicted imageThe number of the pixel points of (a),for the ith low resolution image xiThe r-th pixel point of the high-resolution image generated by the neural network,for the ith high resolution image Y in the high resolution video image set YiThe corresponding r-th pixel point;
step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating the lossFunction LMSEWhile at the same time at a learning rate lrsTo update the network weight, in this embodiment, the learning rate lrsAnd taking 5 e-5. When the training iteration times reach the set times or the loss error is smaller than the set threshold value, stopping training so as to obtain an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence thereof by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.
Examples
In order to verify the effectiveness of the method, a common synthetic data set and a common real data set are selected for training and testing.
In view of the proposed end-to-end training of SGM networks, a data set is required containing LR strength frames, corresponding event streams and HR strength frame inputs. In this embodiment, event data is generated using v2e using a network trained using a composite dataset. The NFS dataset and GoPro dataset at high frame rate and high resolution are used as input sources. Thus, high resolution intensity images are readily available. To simulate a real APS frame, the frame size of the video is reduced to 128 × 128 in this embodiment to generate an LR event stream using V2E. The corresponding HR intensity frame is simply down-sampled according to a training high scale factor (x 2 or x 4). The composite data set generated 3828 data tuples from the 132 video sequences. In order to improve the generalization capability of the network to real event data, sampling is carried out according to normal distribution with the mean value of 0.15 and the standard deviation of 0.03, and positive and negative contrast thresholds are randomly set when an event is generated. For the test data set, a composite data set and a real data set are also set to evaluate the method. The composite test data set consists of 841 intensity images and a corresponding stream of analog events between two consecutive frames in 19 videos. For real-world test datasets, an HQF dataset was selected, consisting of real events and low resolution frames in different outdoor and indoor scenes, captured by a DAVIS346 camera.
The structural similarity (PSNR) and the peak signal-to-noise ratio (SSIM) and the perceptual image block similarity (LPIPS) are used as evaluation indexes.
In the embodiment, five methods are selected for effect comparison with the method provided by the invention, and the selected methods are DPT, E2SRI, DCSR, EvIntSR, eSL-Net, SPADE and SGM-Net respectively.
The results obtained from the experimental results are shown in tables 1 and 2:
TABLE 1 Experimental results of experiments conducted on SR x2 using the method of the present invention and six selected comparative methods
Methods | DPT | E2SRI | DCSR | EvIntSR | eSL-Net | SPADE | SGM-Net |
PSNR | 26.10 | 23.05 | 25.06 | 23.13 | 28.41 | 23.89 | 30.77 |
SSIM | 0.874 | 0.784 | 0.804 | 0.776 | 0.880 | 0.773 | 0.913 |
LPIPS | 0.107 | 0.192 | 0.121 | 0.151 | 0.092 | 0.139 | 0.063 |
TABLE 2 Experimental results of experiments on SR x4 using the method of the present invention and six selected comparative methods
Methods | DPT | E2SRI | DCSR | EvIntSR | eSL-Net | SPADE | SGM-Net |
PSNR | 25.76 | 21.06 | 19.51 | 23.25 | 26.80 | 21.11 | 28.40 |
SSIM | 0.841 | 0.729 | 0.688 | 0.745 | 0.869 | 0.701 | 0.897 |
LPIPS | 0.088 | 0.192 | 0.229 | 0.149 | 0.099 | 0.191 | 0.082 |
As can be seen from tables 1 and 2, when the super-resolution is doubled and quadrupled on the same data set, the method of the present invention has better effect than other six methods, thereby proving the feasibility of the method proposed by the present invention. Experiments show that the method provided by the invention can fully utilize prior information provided by event data to complete a super-resolution task of the video.
Claims (1)
1. An event-driven video super-resolution method based on stereogram modeling is characterized by comprising the following steps:
step 1, acquiring training video data and a corresponding event sequence, and segmenting the event sequence:
step 1.1, a real low-resolution video image set is obtained and stored in units of frames, and is recorded as X ═ X1,x2,...,xi,...,xNIn which xiRepresenting the ith frame of low-resolution image, wherein i is 1,2, and N is the frame number of the low-resolution image; obtaining an event sequence corresponding to a low-resolution video image set X, dividing the event sequence into a corresponding number of event sequences according to the number of frames N, and recording the event sequences as E ═ { E }1,e2,...,ei,...,eN},eiRepresenting an event sequence corresponding to the ith frame of low-resolution image;
acquiring a high-resolution video image set, storing the high-resolution video image set in units of frames, and recording the high-resolution video image set as Y ═ Y1,y2,...,yj,...,yNIn which yjRepresenting a j frame clear image;
let I ═ { X, E, Y } denote the training image dataset;
step 2, constructing a video super-resolution neural network, which comprises the following steps: the system comprises a pixel attention module, a sampling module, an event graph module, a feature interaction module and a decoding module;
step 2.1, a pixel attention module is constructed and used for carrying out feature extraction on image data:
the pixel attention module consists of 5 convolution layers and 1 sigmoid function, wherein the convolution kernel size in each convolution layer is ks, and the step length is s;
the ith frame low resolution image xiInputting the image into the pixel attention module, and processing the image by using the 1 st convolution layer respectively to obtain the i frame low-resolution image xiKey ofiSum valuei(ii) a Wherein, keyiThrough the treatment of the 2 nd convolution layer, the product is obtainedA to dictionary key; at the same time, keyiObtaining a query key through the processing of the 3 rd convolution layer and the 1 st sigmoid layer, and obtaining an incidence matrix A after performing product operation on the dictionary key and the query keyi,key;
ValueiAfter 1 convolutional layer, the processed value is obtainedi', and associated with the correlation matrix Ai,keyMultiplying to obtain the ith low-resolution image xiImage feature ofC represents the number of channels;
step 2.2 event sequence e corresponding to i-th frame low resolution imageiInputting the data into the sampling module for sampling:
step 2.2.1 sequence the ith event eiAfter 3D voxelization, averagely dividing the voxelization data into H blocks, then carrying out averaging processing on each block to obtain H sampled event key points, and marking the H event key point as an H event key point
Step 2.2.2 Key Point for h eventIs set as an event feature flati,hAt the h-th event key pointThe region outside as its neighborhood
Step 2.2.3 Critical Point to h-th event Using equation (1)Corresponding event feature feati,hPerforming an update to obtain an updatedH event key pointEvent feature feat'i,h:
feat′i,h=Up(Conv(feati,h)) (1)
Equation (1), Conv represents a local convolution operation, and Up represents an upsampling operation;
step 2.3 the event map module utilizes equation (2) for event features { feat'i,h}HPerforming stereo graphic operation to obtain the ith event sequence eiGlobal event feature of
(ii) in formula (2) { feat'i,h}HRepresents the ith event sequence eiEvent characteristics corresponding to all event key points; up represents an upsampling operation and Conv represents a convolution operation; feat'i,qEvent features representing the qth event key point;represents a product function and has:
step 2.4, constructing the feature interaction module and fusing the event features and the image features:
the feature interaction module comprises x downsampling layers and y convolution layers sharing weight;
image featuresAnd global event featuresInputting the common features of the g-th group into the feature interaction module, obtaining the common features of the g-th group by using an equation (4) after x downsampling layers and y convolution layers sharing weightsThereby obtaining the ith event sequence eiAnd the ith low resolution image xiAre all characterized in
In the formula (3), the reaction mixture is,<·,·>the inner product is represented by the sum of the two,representing global event featuresThe G-th group of characteristics, G represents the number of groups;
step 2.5, the feature interaction module is used for outputting a final high-resolution image;
step 2.5.1 defines the number of iterations as p, and initializes p to 1; defining the maximum iteration number as P; the ith event sequence eiAnd the ith low resolution image xiAre all characterized inAs input data for the p-th iteration;
step 2.5.2, inputting the input data of the p iteration into a pixel attention module and an event graph module respectively for processing; correspondingly obtaining global event characteristics and image characteristics of the p iteration;
2.5.3, after the global event characteristics and the image characteristics of the p iteration are processed by the characteristic interaction module, obtaining input data of the (p + 1) th iteration;
step 2.5.4, after P +1 is assigned to P, judging whether P is more than P, if so, inputting the input data of the (P + 1) th iteration into the decoding module for processing, thereby obtaining the j frame high-resolution predicted imageOtherwise, the step 2.5.2 is returned to be executed in sequence;
step 3, constructing a back propagation loss function L by using the formula (5)MSE:
In the formula (3), R is the j-th frame high-resolution predicted imageThe number of the pixel points of (a),for the ith low resolution image xiThe r-th pixel point of the high-resolution image generated by the neural network,for the ith high resolution image Y in the high resolution video image set YiThe corresponding r-th pixel point;
step 4, training the video super-resolution neural network based on the low-resolution image set X and the event sequence E thereof, and calculating a loss function LMSEWhile at the same time at a learning rate lrsUpdating the network weight, stopping training when the training iteration number reaches the set number or the loss error is less than the set threshold value, therebyObtaining an optimal super-resolution model; and processing the low-resolution video image and the corresponding event sequence by using the optimal super-resolution network so as to obtain the corresponding high-resolution video image.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210245281.1A CN114612305B (en) | 2022-03-14 | 2022-03-14 | Event-driven video super-resolution method based on stereogram modeling |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210245281.1A CN114612305B (en) | 2022-03-14 | 2022-03-14 | Event-driven video super-resolution method based on stereogram modeling |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114612305A true CN114612305A (en) | 2022-06-10 |
CN114612305B CN114612305B (en) | 2024-04-02 |
Family
ID=81863410
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210245281.1A Active CN114612305B (en) | 2022-03-14 | 2022-03-14 | Event-driven video super-resolution method based on stereogram modeling |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114612305B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862122A (en) * | 2022-12-27 | 2023-03-28 | 北京衔微医疗科技有限公司 | Fundus image acquisition method, fundus image acquisition device, computer equipment and readable storage medium |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2711892A2 (en) * | 2012-09-24 | 2014-03-26 | Vision Semantics Limited | Improvements in resolving video content |
CN111667442A (en) * | 2020-05-21 | 2020-09-15 | 武汉大学 | High-quality high-frame-rate image reconstruction method based on event camera |
US20210209731A1 (en) * | 2020-01-03 | 2021-07-08 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Video processing method, apparatus, device and storage medium |
CN113610707A (en) * | 2021-07-23 | 2021-11-05 | 广东工业大学 | Video super-resolution method based on time attention and cyclic feedback network |
-
2022
- 2022-03-14 CN CN202210245281.1A patent/CN114612305B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2711892A2 (en) * | 2012-09-24 | 2014-03-26 | Vision Semantics Limited | Improvements in resolving video content |
US20210209731A1 (en) * | 2020-01-03 | 2021-07-08 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Video processing method, apparatus, device and storage medium |
CN111667442A (en) * | 2020-05-21 | 2020-09-15 | 武汉大学 | High-quality high-frame-rate image reconstruction method based on event camera |
CN113610707A (en) * | 2021-07-23 | 2021-11-05 | 广东工业大学 | Video super-resolution method based on time attention and cyclic feedback network |
Non-Patent Citations (2)
Title |
---|
刘娜;李翠华;: "基于多层卷积神经网络学习的单帧图像超分辨率重建方法", 中国科技论文, no. 02 * |
刘村;李元祥;周拥军;骆建华;: "基于卷积神经网络的视频图像超分辨率重建方法", 计算机应用研究, no. 04 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115862122A (en) * | 2022-12-27 | 2023-03-28 | 北京衔微医疗科技有限公司 | Fundus image acquisition method, fundus image acquisition device, computer equipment and readable storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN114612305B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhang et al. | Recursive neural network for video deblurring | |
Zhang et al. | Multi-level fusion and attention-guided CNN for image dehazing | |
US20220222776A1 (en) | Multi-Stage Multi-Reference Bootstrapping for Video Super-Resolution | |
Huang et al. | Self-filtering image dehazing with self-supporting module | |
CN113011329B (en) | Multi-scale feature pyramid network-based and dense crowd counting method | |
CN110580472B (en) | Video foreground detection method based on full convolution network and conditional countermeasure network | |
CN111723693B (en) | Crowd counting method based on small sample learning | |
CN114463218B (en) | Video deblurring method based on event data driving | |
CN110443761B (en) | Single image rain removing method based on multi-scale aggregation characteristics | |
Wang et al. | Video deblurring via spatiotemporal pyramid network and adversarial gradient prior | |
CN111445418A (en) | Image defogging method and device and computer equipment | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN111369548A (en) | No-reference video quality evaluation method and device based on generation countermeasure network | |
CN105931189B (en) | Video super-resolution method and device based on improved super-resolution parameterized model | |
Zhang et al. | Unsupervised depth estimation from monocular videos with hybrid geometric-refined loss and contextual attention | |
Zhang et al. | Learning to restore light fields under low-light imaging | |
Zhou et al. | PADENet: An efficient and robust panoramic monocular depth estimation network for outdoor scenes | |
CN109871790B (en) | Video decoloring method based on hybrid neural network model | |
CN114612305B (en) | Event-driven video super-resolution method based on stereogram modeling | |
Tang et al. | Structure-embedded ghosting artifact suppression network for high dynamic range image reconstruction | |
Wan et al. | Progressive convolutional transformer for image restoration | |
Cui et al. | Multi-stream attentive generative adversarial network for dynamic scene deblurring | |
CN116403152A (en) | Crowd density estimation method based on spatial context learning network | |
CN112862723B (en) | Real image denoising method based on pseudo-3D autocorrelation network | |
Xue et al. | Bwin: A bilateral warping method for video frame interpolation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |