WO2023050723A1 - 视频插帧方法、装置、电子设备、存储介质、程序及程序产品 - Google Patents

视频插帧方法、装置、电子设备、存储介质、程序及程序产品 Download PDF

Info

Publication number
WO2023050723A1
WO2023050723A1 PCT/CN2022/079310 CN2022079310W WO2023050723A1 WO 2023050723 A1 WO2023050723 A1 WO 2023050723A1 CN 2022079310 W CN2022079310 W CN 2022079310W WO 2023050723 A1 WO2023050723 A1 WO 2023050723A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature map
frame
scale
event
initial
Prior art date
Application number
PCT/CN2022/079310
Other languages
English (en)
French (fr)
Inventor
于志洋
张宇
邹冬青
任思捷
Original Assignee
深圳市慧鲤科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市慧鲤科技有限公司 filed Critical 深圳市慧鲤科技有限公司
Publication of WO2023050723A1 publication Critical patent/WO2023050723A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the field of computer technology, and in particular to a video frame insertion method, device, electronic equipment, storage medium, program and program product.
  • the frame rate of the original video may be increased by interpolating frames of the original video, for example, by using a video frame interpolation technology such as an optical flow estimation algorithm.
  • a video frame interpolation technology such as an optical flow estimation algorithm.
  • the image quality of the frame to be inserted generated by the relevant video frame insertion technology is not high, thereby reducing the picture quality of the video after frame insertion, for example, causing the picture of the video to shake and distort after frame insertion.
  • Embodiments of the present disclosure at least provide a video frame insertion method, device, electronic equipment, storage medium, program, and program product.
  • a video frame insertion method including: acquiring an initial frame to be inserted corresponding to the video to be processed, and first event information corresponding to the initial frame to be inserted, the first The event information is used to characterize the motion trajectory of the object in the initial frame to be inserted; feature extraction is performed on the initial frame to be inserted and the first event information respectively to obtain the initial frame feature map corresponding to the initial frame to be inserted and An event feature map corresponding to the first event information; generating a target frame to be inserted according to the initial frame feature map and the event feature map; inserting the target frame to be inserted into the video to be processed for processing post video.
  • the picture quality of the processed video can be improved, and it is beneficial to reduce the jitter and distortion of the picture in the processed video.
  • a video frame insertion device including: an acquisition module configured to acquire an initial frame to be inserted corresponding to a video to be processed, and first event information corresponding to the initial frame to be inserted , the first event information is used to characterize the motion trajectory of the object in the initial frame to be inserted; the feature extraction module is configured to perform feature extraction on the initial frame to be inserted and the first event information respectively to obtain the An initial frame feature map corresponding to the initial frame to be inserted and an event feature map corresponding to the first event information; a generating module configured to generate a target frame to be inserted according to the initial frame feature map and the event feature map; frame insertion A module configured to insert the target frame to be inserted into the video to be processed to obtain a processed video.
  • an electronic device including: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to call the instructions stored in the memory, to implement the method described in the first aspect.
  • a computer-readable storage medium on which computer program instructions are stored, and when the computer program instructions are executed by a processor, the method described in the first aspect is implemented.
  • a computer program including computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the first method described in the aspect.
  • a computer program product including computer readable code, when the computer readable code runs in an electronic device, a processor in the electronic device executes the first A method as described in one aspect.
  • the first event information representing the motion track of the object in the initial frame to be inserted to optimize the initial frame to be inserted in the video to be processed, so that the image quality of the generated target frame to be inserted is higher than that of the initial frame.
  • Frames to be inserted so as to improve the picture quality of the processed video, and help reduce the jitter and distortion of the picture in the processed video.
  • Fig. 1 shows a flowchart of a video frame insertion method according to an embodiment of the present disclosure.
  • Fig. 2 shows a schematic diagram of a fusion feature map generation process according to an embodiment of the present disclosure.
  • Fig. 3 shows a schematic diagram of a raw frame event feature map according to an embodiment of the present disclosure.
  • Fig. 4 shows a schematic diagram of an image processing network implemented according to the present disclosure.
  • Fig. 5 shows a block diagram of a video frame insertion device according to an embodiment of the disclosure.
  • Fig. 6 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • Fig. 7 shows a block diagram of an electronic device according to an embodiment of the present disclosure.
  • Fig. 1 shows a flow chart of a video frame insertion method according to an embodiment of the present disclosure
  • the video frame insertion method may be executed by an electronic device such as a terminal device or a server
  • the terminal device may be a user equipment (User Equipment, UE), a mobile device , user terminal, terminal, cellular phone, cordless phone, personal digital assistant (Personal Digital Assistant, PDA), handheld device, computing device, vehicle-mounted device, wearable device, etc.
  • the method can call the computer stored in the memory through the processor It can be realized by means of readable instructions, or, the method can be executed by a server.
  • the video frame insertion method includes:
  • step S11 the initial frame to be inserted corresponding to the video to be processed, and the first event information corresponding to the initial frame to be inserted are acquired, and the first event information is used to represent the motion track of the object in the initial frame to be inserted.
  • the video to be processed may be understood as a low frame rate video to be inserted into a video frame.
  • optical flow estimation algorithms known in the art, such as PWCNet algorithm, FlowNet algorithm, etc., can be used to calculate the optical flow from any two original video frames in the video to be processed to the moment of frame insertion, and According to the optical flow, the original video frame is rendered to obtain the initial frame to be interpolated by means of forward rendering (that is, forward mapping).
  • forward rendering that is, forward mapping
  • the first event information may be determined according to the event signal collected by the event camera.
  • the basic principle of an event camera can be simply understood as: when the cumulative brightness change of a certain collection point reaches a certain brightness threshold, an event signal is output, where the brightness threshold is an inherent parameter of the event camera, and the event signal can represent the event camera.
  • the event camera will generate a series of microsecond-level event signals, which can be output in the form of event streams, based on this , according to the event flow collected by the event camera, the event information representing the motion trajectory of the object at any second-level moment can be obtained.
  • the event signal at the frame insertion time corresponding to the initial frame to be inserted can be accumulated to obtain the first event information corresponding to the initial frame to be inserted, then the first event information can also represent The motion trajectory of the object at the frame insertion moment, and the first event information can record the accumulated value of the above-mentioned event signal at the frame insertion moment in the form of a "graph". In this way, the event feature map in the first event information can be extracted later.
  • the video to be processed can also be collected by the event camera, that is, the event camera can simultaneously collect the event signal and the video signal, and the event signal is in the form of event stream Output, the video signal is output in the form of video stream.
  • the video to be processed can also be collected by other types of cameras (such as monocular cameras). Other types of cameras and event cameras can simultaneously collect signals for the same scene, which is not limited in this embodiment of the present disclosure.
  • step S12 feature extraction is performed on the initial frame to be inserted and the first event information respectively, to obtain an initial frame feature map corresponding to the initial frame to be inserted and an event feature map corresponding to the first event information.
  • a feature extraction network known in the art can be used, for example, a Unet network, an AlexNet network, etc. can be used to perform feature extraction on the initial frame to be inserted, and the initial frame corresponding to the initial frame to be inserted can be obtained A feature map, and performing feature extraction on the first event signal to obtain an event feature map corresponding to the first event information. It should be understood that the embodiment of the present disclosure does not limit the feature extraction network used.
  • step S13 a target frame to be inserted is generated according to the initial frame feature map and the event feature map.
  • the initial frame feature map and event feature map extracted through step S12 may be multi-scale feature maps.
  • the target frame to be inserted is generated according to the initial frame feature map and event feature map , may include: performing multi-scale feature fusion on the initial frame feature map and event feature map through a multi-scale feature fusion network (such as a feature pyramid network) known in the art to obtain a fused feature map; The feature map is decoded to obtain the target frame to be inserted.
  • a multi-scale feature fusion network such as a feature pyramid network
  • the decoding network corresponds to the network structure of the above-mentioned feature extraction network, and the above-mentioned feature extraction network can also be called an encoding network.
  • the target frame to be inserted generated in this way can integrate the feature information representing the motion trajectory of the object in the event feature map into the initial frame feature map, which can make the object displayed in the generated target frame to be inserted clearer and more stable. That is, the image quality of the target frame to be inserted is improved.
  • step S14 the target frame to be inserted is inserted into the video to be processed to obtain a processed video.
  • inserting the target frame to be inserted into the video to be processed to obtain the processed video may include: inserting the target frame to be inserted into the video to be processed according to the frame insertion time corresponding to the initial frame to be inserted , the processed video is obtained, wherein the frame rate of the processed video is higher than that of the video to be processed, that is, the processed video can be understood as a high frame rate video. It should be understood that computer vision technology known in the art may be used to insert the target frame to be inserted into the video to be processed, which is not limited by the embodiments of the present disclosure.
  • the first event information representing the motion track of the object in the initial frame to be inserted to optimize the initial frame to be inserted in the video to be processed, so that the image quality of the generated target frame to be inserted is higher than that of the initial frame.
  • Frames to be inserted so as to improve the picture quality of the processed video, and help reduce the jitter and distortion of the picture in the processed video.
  • step S13 generating the target frame to be inserted according to the initial frame feature map and the event feature map, including:
  • Step S131 Generate an estimated frame to be inserted according to the initial frame feature map and the event feature map;
  • the initial frame feature map and the event feature map can be multi-scale.
  • the initial The frame feature map and the event feature map are fused with multi-scale features to obtain the fused feature map; and then the fused feature map is decoded through the decoding network to obtain the estimated frame to be inserted.
  • Step S132 According to the original video frame in the video to be processed, which is adjacent to the frame insertion time of the initial frame to be inserted, and the second event information corresponding to the original video frame, optimize the estimated frame to be inserted to obtain the target frame to be inserted , the second event information is used to characterize the motion track of the object in the original video frame.
  • the original video frames in the video to be processed that are adjacent to the frame insertion moment of the initial frame to be inserted can be understood as the original video frames in the video to be processed that are adjacent in time sequence to the frame insertion moment in the video to be processed.
  • the second event information corresponding to the original video frame can be obtained by referring to the determination method of the first event information in the above-mentioned embodiments of the present disclosure, that is, the acquisition time corresponding to the original video frame can be The event signals of the original video frame are accumulated to obtain the second event information corresponding to the original video frame, and then the second event information can represent the motion trajectory of the object at the acquisition time corresponding to the original video frame.
  • the estimated frame to be inserted is optimized to obtain the target frame to be inserted.
  • it may include: based on the attention mechanism, using the residual network to analyze the original
  • the residual feature extraction is performed on the combined information of the video frame and the second event information to obtain a residual detail map, and image fusion is performed on the residual detail map and the estimated frame to be inserted to obtain the target frame to be inserted.
  • the detailed information of the object in the original video frame can be extracted, and the detailed information of the object can be fused into the predicted frame to be inserted, thereby enhancing the image quality of the predicted frame to be inserted, even if the target frame to be inserted has Higher image quality.
  • the initial frame feature map and the event feature map can be multi-scale.
  • the initial frame feature map includes S scales
  • the event feature map includes S scales
  • S is a positive integer, s ⁇ [1,S)
  • an estimated frame to be inserted is generated according to the initial frame feature map and the event feature map, including:
  • Step S1311 According to the initial frame feature map of the 0th scale and the event feature map of the 0th scale, a fusion feature map of the 0th scale is obtained.
  • the initial frame feature map at the 0th scale and the event feature map at the 0th scale can be understood as the feature maps with the lowest scale or minimum size and minimum resolution in the initial frame feature map and event feature map, respectively.
  • the fusion feature map of the 0th scale is obtained according to the initial frame feature map of the 0th scale and the event feature map of the 0th scale, which may include: combining the initial frame feature map of the 0th scale with the The 0-scale event feature map is channel-spliced to obtain a spliced feature map; the spliced feature map is filtered to obtain a 0-th scale fusion feature map. In this way, the fusion feature map of the 0th scale can be obtained conveniently and effectively.
  • channel splicing can be understood as splicing in the channel dimension of the feature map.
  • two feature maps with 128 channels and 16 ⁇ 16 size can be obtained through channel splicing to obtain a feature map with 256 channels and 16 ⁇ 16 size.
  • the concatenated feature map can be filtered through a convolution layer with a convolution kernel of 1 ⁇ 1 size to obtain a fusion feature map of the 0th scale, wherein the convolution kernel in the convolution layer The number of is the same as the number of channels of the initial frame feature map at scale 0.
  • the size and number of channels of the fusion feature map at the 0th scale are the same as the event feature map at the 0th scale and the initial frame feature map at the 0th scale.
  • the spliced feature map is 256 channels
  • 16 The feature map with a size of ⁇ 16 is converted into 128 convolutional layers with a size of 1 ⁇ 1 through the convolution kernel, and the spliced feature map is filtered to obtain a 128-channel, 16 ⁇ 16-sized fusion feature map of the 0th scale.
  • Step S1312 According to the fused feature map of the (s-1)th scale, spatially align the initial frame feature map of the sth scale with the event feature map of the sth scale, and obtain the fused initial frame feature map of the sth scale and the first Fusible event feature maps at s-scale.
  • the initial frame feature map and the event feature map can be understood as expressions of different perspectives on objects, or that the feature spaces of the initial frame feature map and the event feature map are different, in order to facilitate the initial frame feature map and the event feature map. Fusion can transform the initial frame feature map and event feature map into the same feature space, that is, spatially align the initial frame feature map and event feature map.
  • the initial frame feature map of the s-th scale is spatially aligned with the event feature map of the s-th scale, which can be understood as converting the initial frame feature map and the event feature map to the fusion
  • the s-th scale fused initial frame feature map obtained in this way and the s-th scale fused event feature map can perform feature fusion in the same feature space.
  • Adaptive Instance Normalization can be used to align feature maps expressed from different perspectives in the same space, that is, to achieve The fused feature maps of the s-th scale spatially align the initial frame feature maps at the s-th scale with the event feature maps at the s-th scale.
  • Step S1313 According to the fused feature map of the (s-1)th scale, the fused initial frame feature map of the sth scale, and the fused event feature map of the sth scale, a fused feature map of the sth scale is obtained.
  • the fused feature map of the s-1th scale, the fused initial frame feature map of the s-th scale, and the fused event feature map of the s-th scale it may include: for the s-1th scale The fusion feature map of s is up-sampled to obtain an up-sampling feature map, where the size of the up-sampling feature map is the same as the initial frame feature map of the s-th scale and the event feature map of the s-th scale; the up-sampling feature map and the s-th scale Feature fusion between the fused initial frame feature map and the s-th scale fused event feature map to obtain the s-scale fused feature map.
  • the feature fusion method known in the art can be used to realize the feature fusion between the above three feature maps, for example, the method of adding (add) the three feature maps and keeping the number of channels unchanged, or three
  • the method of merging (concat) the feature maps in the channel dimension and increasing the number of channels is not limited by this embodiment of the present disclosure.
  • step S1312 to step S1313 can be understood as a recursive feature fusion process, wherein the recursive fusion process of the fusion feature maps of each scale except the fusion feature map of the 0th scale can be expressed as the formula ( 1),
  • X s-1 represents the fusion feature map of the s-1th scale
  • f s represents the initial frame feature map of the s-th scale
  • e s represents the event feature map of the s-th scale
  • g(X s-1 ; f s , e s ) represents the process of spatial alignment and feature fusion in the above steps S1312 to S1313.
  • Step S1314 Decoding the fused feature map of the (S-1)th scale to obtain an estimated frame to be inserted.
  • the fused feature map can be decoded by a decoding network to obtain an estimated frame to be inserted, wherein the decoding network corresponds to the network structure of the above-mentioned feature extraction network, and the above-mentioned feature extraction network can also be called an encoding network.
  • the fused feature map of the S-1th scale can be understood as the fused feature map obtained after the last feature fusion, that is, the above-mentioned fused feature map. Based on this, the S-1th scale can be analyzed by the decoding network. The feature map is decoded to obtain the estimated frame to be inserted.
  • the target frame to be inserted can be directly generated according to the initial frame feature map and the event feature map, that is, the estimated frame to be inserted can be directly used as The target frame to be inserted.
  • the estimated image quality of the frame to be inserted is already higher than the initial frame to be inserted, and when the image quality of the estimated frame to be inserted has met the user's image quality requirements, the estimated frame to be inserted can be directly used as the target to be inserted.
  • Frame insertion is inserted into the video frame to be processed. In this way, a clear and stable video to be processed can be quickly obtained.
  • the multi-scale adaptive feature fusion between the initial frame feature map and the event feature map can be effectively realized, so as to effectively obtain the estimated frame to be inserted.
  • step S1312 according to the first (s-1)-scale fusion feature map, spatially align the initial frame feature map of the s-th scale with the event feature map of the s-th scale, and obtain the fusionable initial frame feature map of the s-th scale and the fusionable feature map of the s-th scale
  • Event feature maps including:
  • the upsampling feature map is the same size as the initial frame feature map of the sth scale and the event feature map of the sth scale;
  • the fusionable initial frame feature map of the sth scale is obtained, wherein the first spatial transformation relationship is based on the initial frame of the sth scale
  • the first pixel size scaling information and the first offset information of the feature map during space conversion, and the feature information of the upsampled feature map are determined;
  • the fusionable event feature map of the sth scale is obtained, wherein the second spatial conversion relationship is based on the event feature map of the sth scale in The second pixel size scaling information and the second offset information during space conversion, and the feature information of the upsampled feature map are determined;
  • the fusionable initial frame feature map of the sth scale, the fusionable event feature map of the sth scale and the upsampling feature map are in the same feature space, and the pixel size scaling information represents the scaling ratio of each pixel in the space transformation, The offset information represents the position offset of each pixel in the space transformation.
  • the first space transformation relationship can be expressed as formula (2-1), and the second space transformation relationship can be expressed as formula (2-2)
  • the pixel size can be understood as a pixel-level size, or in other words, the size occupied by each pixel in the feature map, where the size scaling ratio includes a size enlargement ratio or a size reduction ratio. It should be understood that during space conversion, the pixel size of each pixel may increase (or enhance), or may shrink (or weaken), and the position of each pixel may be shifted. Based on this, According to the pixel size scaling and position offset, the feature maps in different feature spaces can be spatially aligned, that is, the feature maps in different feature spaces can be transformed into the same feature space.
  • the first spatial transformation relationship and the second spatial transformation relationship can be effectively used to spatially align the initial frame feature map of the s-th scale with the event feature map of the s-th scale to obtain a feature fusion, Fusible initial frame feature maps at the s-th scale and fusible event feature maps at the s-th scale.
  • the event signal has a good perception ability for the boundary of the moving object, because this kind of motion often causes the brightness change of the collected points on the object, and the optical flow motion estimation algorithm based on the pure video signal, in this kind of motion
  • the motion estimation value of the object is often unreliable, but for the static area with simple texture, the perception ability of the event camera will be weakened, and the reliability of the event information captured by it may not be as reliable as the video information extracted from the video signal, that is, the event information It is complementary information to video information.
  • step S1313 according to the fused feature map of the s-1th scale, the fused initial frame feature map of the sth scale, and the fused event feature map of the sth scale, the sth scale is obtained
  • the fusion feature map including:
  • Step S13131 Perform convolution processing and nonlinear processing on the upsampled feature map to obtain a mask map corresponding to the upsampled feature map, wherein the upsampled feature map is to upsample the fusion feature map of the (s-1)th scale owned;
  • convolution processing and nonlinear processing may be performed on the upsampling feature map through a convolution layer and an activation function (such as sigmoid) layer to obtain a mask map corresponding to the upsampling feature map.
  • the mask map can represent whether each pixel in the upsampled feature map is a pixel on a moving object. It should be understood that the embodiment of the present disclosure does not limit the size and number of convolution kernels in the above convolution layer and the activation function type used in the activation function layer.
  • the mask image can be recorded in the form of a binary mask (that is, 0 and 1), that is, for example, "0" can be used to represent a pixel point on a moving object, and "1" can be used to represent The representation is not a pixel on a moving object, which is not limited in this embodiment of the present disclosure.
  • Step S13132 According to the mask map, perform feature fusion on the fusionable initial frame feature map of the sth scale and the fusionable event feature map of the sth scale to obtain the fusion feature map of the sth scale.
  • formula (3) can be used to realize the feature fusion of the s-th scale fusion initial frame feature map and the s-th scale fusion event feature map according to the mask map to obtain the s-th scale
  • m represents the mask map
  • 1-m represents the reverse mask map
  • y e represents the sth scale fusion event feature map
  • y f represents the s scale fusion initial frame feature map
  • y can represent the sth scale
  • the mask image m can be recorded based on a binary mask, and the reverse mask image can be expressed as 1-m.
  • FIG. 2 shows a schematic diagram of a fusion feature map generation process according to an embodiment of the present disclosure.
  • FIG. 2 shows a schematic diagram of a fusion feature map generation process according to an embodiment of the present disclosure.
  • the fusion feature map X s-1 of the s-1th scale is upsampled and instance normalized to obtain the upsampled feature map
  • Upsampled feature maps Input to the convolutional layer (1 ⁇ 1Conv) and activation function (such as sigmoid) layer with a convolution kernel of 1 ⁇ 1 size to obtain a mask map (m) and a reverse mask map (1-m), for the initial frame
  • the feature map f s and the event feature map e s can respectively use two sets of independent convolutional layers to learn the corresponding c f , b f and c e , be e during the space transformation, using the above formula (2-1), Formula (2-2) and formula (3), get the fused feature map X s of the sth scale.
  • the mask map corresponding to the upsampling feature map under the guidance of the mask map corresponding to the upsampling feature map, it is possible to adaptively perform feature fusion on the fusionable initial frame feature map of the sth scale and the fusionable event feature map of the sth scale .
  • step S13132 perform feature fusion on the fusionable initial frame feature map of the sth scale and the fusionable event feature map of the sth scale to obtain the fusion feature of the sth scale Figures, including:
  • the mask map perform feature fusion on the fusionable initial frame feature map at the sth scale and the fusionable event feature map at the sth scale to obtain the initial fusion feature map at the sth scale; Convolution processing and nonlinear processing to obtain the fusion feature map of the sth scale.
  • the feature fusion of the fusionable initial frame feature map of the sth scale and the fusionable event feature map of the sth scale can be performed adaptively.
  • the implementation method shown in formula (3) can be referred to to achieve feature fusion of the fusionable initial frame feature map of the s-th scale and the fusionable event feature map of the s-th scale according to the mask map, and obtain the s-th scale
  • the initial fusion feature map, that is, y in the above formula (3) can also represent the initial fusion feature map of the sth scale.
  • the s-th scale fusion initial frame feature map is fused with the s-th scale fusion event feature map to obtain the s-th scale
  • the initial fusion feature map of can include: calculating the Hadamard product between the mask map and the fusion event feature map of the sth scale; according to the reverse mask map corresponding to the mask map, calculating the reverse mask map and the sth scale The product between the s-scale fused initial frame feature maps; the Hadamard product is added to the product to obtain the s-th scale initial fused feature map.
  • the non-linearity of the fusion feature map can be effectively increased or the complexity of the fusion feature map can be increased, which facilitates the realization of multi-scale feature fusion.
  • the s-th scale fused initial frame feature map and the s-th scale fused event feature map can be adaptively fused.
  • the initial fusion feature map of the s-th scale can be convoluted and nonlinearly processed through a convolution layer with a convolution kernel of 3x3 size and an activation function (such as LeakyRelu) layer to obtain The fused feature map at the s-th scale.
  • an activation function such as LeakyRelu
  • the embodiment of the present disclosure does not limit the size and number of convolution kernels in the above convolution layer and the activation function type used in the activation function layer.
  • the nonlinearity of the fused feature map can be effectively increased or the complexity of the fused feature map can be increased, so as to facilitate the realization of multi-scale feature fusion.
  • the image details of the object in the original video frame can be combined with the motion track of the object in the original video frame to fuse the detailed information of the object into the predicted frame to be inserted, thereby enhancing the image quality of the predicted frame to be inserted.
  • the estimated The frame to be inserted is optimized to obtain the target frame to be inserted, including:
  • Step S1321 Combine the estimated frame to be inserted with the first event information to obtain estimated frame event combination information.
  • the first event information can represent the motion trajectory of the object at the frame insertion time corresponding to the initial frame to be inserted, and the estimated frame to be inserted is based on the initial frame feature map of the initial frame to be inserted and the event feature map of the first event information
  • the first event information may record the cumulative value of the event signal at the frame insertion time corresponding to the initial frame to be inserted in the form of a "graph".
  • the predicted frame event combination information includes the predicted frame to be inserted and the first event information.
  • Step S1322 Combine the original video frame with the second event information to obtain original frame event combination information.
  • the second event information can represent the motion trajectory of the object at the acquisition moment corresponding to the original video frame, and the second event information can record the accumulated value of the event signal at the acquisition moment corresponding to the original video frame in the form of a "graph".
  • the original frame event combination information includes the predicted frame to be inserted and the second event information.
  • Step S1323 Perform feature extraction on the estimated frame event combination information and the original frame event combination information respectively, and obtain the estimated frame event feature map corresponding to the estimated frame event combination information and the original frame event feature map corresponding to the original frame event combination information.
  • a multi-layer convolutional layer with parameter sharing can be used to perform feature extraction on the estimated frame event combination information and the original frame event combination information, and obtain the estimated frame event combination information corresponding to The frame event feature map and the original frame event feature map corresponding to the original frame event combination information.
  • the estimated frame event combination information can be input into the 3-layer convolutional layer, and the estimated frame event feature map can be output; the original frame event combination information can be input into the 3-layer convolutional layer, and the original frame can be output Event feature map.
  • the original video frame may be at least one frame, and the original frame event combination information may be at least one, then the original frame event feature map may be at least one. It should be understood that, feature extraction methods known in the art may be used to extract the above-mentioned estimated frame event feature map and original frame event feature map, which is not limited in this embodiment of the present disclosure.
  • Step S1324 According to the estimated frame event feature map, the original frame event feature map is adjusted to obtain an integrated feature map.
  • the attention mechanism can be used to find matching pixels from the original frame event feature map that match each pixel in the estimated frame event feature map, or in other words, from the original frame event feature map Find the matching pixel in the feature map that has the greatest similarity with each pixel in the estimated frame event feature map; then take the pixel position of each matching pixel in the original frame event feature map as the center, and start Multiple feature blocks of specified size are cut out from above, and according to the pixel position of each matching pixel point, the size of multiple feature blocks of specified size is spliced to obtain an integrated feature map.
  • size splicing can be understood as splicing in the length and width dimensions of the feature map, so that the size of the integrated feature map is the same as the size of the original frame event feature map.
  • size splicing can be understood as splicing in the length and width dimensions of the feature map, so that the size of the integrated feature map is the same as the size of the original frame event feature map.
  • four feature maps of size 2 ⁇ 2 are concatenated in size to obtain an integrated feature map of size 4 ⁇ 4.
  • Step S1325 According to the integrated feature map, the estimated frame event feature map and the fusion feature map, optimize the estimated frame to be inserted to obtain the target frame to be inserted.
  • the fusion feature map is a multi-scale process of initial frame feature map and event feature map obtained by fusion.
  • the fused feature map may be obtained by performing multi-scale fusion of the initial frame feature map and the event feature map through steps S1311 to S1313 in the above-mentioned embodiments of the present disclosure, and the process of determining the fused feature map will not be repeated here.
  • the fused feature map can be multi-scale, and the integrated feature map can also be multi-scale.
  • multi-layer convolutional layers can be used to perform feature extraction on the estimated frame event combination information and the original frame event combination information, then the estimated frame event feature map and the original frame event feature map can be multi-scale feature maps , based on which the integrated feature maps can be multi-scale.
  • the estimated frame to be inserted is optimized to obtain the target frame to be inserted, which may include: integrating the feature map, estimated The frame event feature map and the fusion feature map are multi-scale fused to obtain the target fusion feature map; the residual features in the target fusion feature map are extracted through the residual network, and the residual features are decoded through the designated decoding network to obtain the residual The residual information corresponding to the feature; the residual information is superimposed on the estimated frame to be inserted to obtain the target frame to be inserted.
  • multi-scale fusion of the integrated feature map, estimated frame event feature map and fusion feature map can be realized to obtain the target fusion feature map, which will not be repeated here. .
  • the network structure of the specified decoding network may correspond to the multi-layer convolutional layer used to extract the original frame event feature map and the estimated frame event feature map, that is, the above-mentioned multi-layer convolutional layer can be understood as an encoding network.
  • the residual information can also be in the form of a "graph", and the parameter information is superimposed on the estimated frame to be inserted, which can be understood as image fusion of the residual information and the estimated frame to be inserted .
  • the integrated feature map, the estimated frame event feature map, and the fusion feature map can be fused, and the residual information representing image details in the target fusion feature map can be extracted, and then the estimated frame to be inserted and the residual information can be extracted.
  • the image quality of the target frame to be interpolated obtained by superimposing the difference information is higher.
  • multi-layer convolutional layers can be used to perform feature extraction on the estimated frame event combination information and the original frame event combination information, then the estimated frame event feature map and the original frame event feature map can be multi-scale feature maps .
  • the estimated frame event feature map includes S * scales
  • the original frame event feature map includes S * scales, 1 ⁇ S * ⁇ S, S * is a positive integer, s * ⁇ [SS * , S), the size of the estimated frame event feature map of the (SS * ) scale is I ⁇ I, and I is a positive integer, wherein, in step S1324, according to the estimated frame event feature map, the original frame event feature The graph is adjusted to obtain an integrated feature map, including:
  • Step S13241 For any first pixel in the estimated frame event feature map of the (SS * )th scale, determine the first pixel that matches the first pixel from the original frame event feature map of the (SS * )th scale A matching pixel.
  • the first matching pixel point matching the first pixel point may be understood as the first matching feature image having the largest similarity with the first pixel point.
  • the first matching pixel including:
  • any first pixel point calculate the feature similarity between the first pixel point and each pixel point in the specified window in the original frame event feature map of the SS * scale, and the specified window is based on the first pixel point The pixel position is determined; among the pixels in the specified window, the pixel corresponding to the maximum feature similarity is determined as the first matching pixel. In this manner, the first matching pixel points that match each first pixel point can be efficiently determined.
  • the designated window may be, for example, a local window of (2m+1) 2 size around the pixel position of each first pixel point as the center, and m may be set according to actual needs, for example, it may be set is 3, which is not limited by this embodiment of the present disclosure. In this manner, the range of searching for the first matching pixel in the original frame event feature map can be narrowed, the amount of computation can be reduced, and the efficiency of determining the first matching pixel can be improved.
  • Euclidean distance also known as Euclidean distance
  • cosine distance etc.
  • Euclidean distance also known as Euclidean distance
  • cosine distance etc.
  • the first matching pixel is the pixel with the smallest Euclidean distance or cosine distance among the pixels in the specified window.
  • formula (4) shows an implementation manner of determining feature similarity based on an Euclidean distance according to an embodiment of the present disclosure.
  • i represents the pixel position of any first pixel in the estimated frame event feature map of the SS * scale
  • p represents a given integer offset within the specified window
  • i+p represents the pixel position of each pixel in the specified window in the original frame event feature map
  • k 0 (i+p) represents the feature value of each pixel point in the specified window in the original frame event feature map
  • ⁇ 2 means 2-norm D(i,p) represents the Euclidean distance between the first pixel and each pixel in the specified window.
  • p ⁇ [-m,m] 2 ⁇ can be organized into (2m +1) The distance between the 2 "query" vectors and the "key” vector, where j i+p * is the pixel position where the minimum distance is located, k 0 (j) can be understood as the first pixel Match the first matching pixel.
  • Step S13242 Determine the sub-pixel position corresponding to the pixel position according to the pixel position of the first matching pixel point and the specified offset, and the specified offset is a decimal.
  • a local distance field can be constructed with the pixel position j of the first matching pixel point as the center, and the local distance field can be continuously fitted by a parameterized second-order polynomial, and this two The global minimum of the order polynomial has a closed solution.
  • the shape of the local distance field can be adjusted, that is, the parameters of the second-order polynomial can be adjusted to obtain an estimate The specified offset of .
  • the embodiment of the present disclosure will elaborate on the manner of determining the specified offset below.
  • determining the sub-pixel position corresponding to the pixel position may include: adding the pixel position and the specified offset to obtain the sub-pixel position, wherein, due to the specified Offsets are fractional, allowing for more accurate sub-pixel positions at non-integer positions.
  • Step S13243 According to the I ⁇ I sub-pixel positions, the original frame event feature map of the s * th scale is adjusted to obtain the integrated feature map of the s *th scale.
  • the size of the estimated frame event feature map of the SS * scale is I ⁇ I, that is, there are I ⁇ I first pixel points on the estimated frame event feature map of the SS * scale, for each
  • the sub-pixel positions of the first pixel can be obtained according to the above steps S13241 to S13242, that is, I ⁇ I sub-pixel positions can be obtained.
  • the size of the original frame event feature map at the s * th scale is n times the size of the estimated frame event feature map at the SS * scale, and the I ⁇ I sub-pixel positions are based on the estimated frame at the SS * scale
  • the event feature map that is, is determined based on the estimated frame event feature map of the smallest scale.
  • the original frame event feature map of the s * th scale is to be adjusted according to I ⁇ I sub-pixel positions, it can be based on I ⁇ I sub-pixel positions, cut the original frame event feature map of the s * th scale to obtain I ⁇ I, n ⁇ n-sized feature blocks, and I ⁇ I, n ⁇ n-sized feature blocks
  • the tiles are dimensionally concatenated to obtain an integrated feature map at the s * th scale.
  • step S13243 according to the I ⁇ I sub-pixel positions, the s * th scale original frame event feature map is adjusted to obtain the s * th scale integrated feature map, including:
  • each position on each feature block is a non-integer coordinate position
  • each position on each feature block can be obtained by linear interpolation (such as bilinear interpolation). eigenvalues at .
  • FIG. 3 shows a schematic diagram of an original frame event feature map according to an embodiment of the present disclosure.
  • j represents a sub-pixel position, assuming that n is 2, that is, cropping for sub-pixel position j
  • a feature block H j of 2 ⁇ 2 size is obtained.
  • the feature values of the two pixel positions "a6, a7" around the sub-pixel position h1 can be (or the eigenvalues at the four pixel positions "a1, a2, a6, a7") perform bilinear interpolation to obtain the corresponding eigenvalues at the sub-pixel position h1, where, for the other features at h2, h3 and h4 Values, each can perform bilinear interpolation on the eigenvalues at the respective surrounding pixel positions to obtain their corresponding eigenvalues.
  • the eigenvalues at least two pixel positions around each position on each feature block can be used to perform bilinear Interpolation yields feature values at each location on each feature tile.
  • the size of the I ⁇ I and n ⁇ n-sized feature blocks is spliced, which can be understood as, according to the I ⁇ I sub-pixel positions in the size dimension (that is, the length and width dimensions ) to splice I ⁇ I feature blocks of n ⁇ n size, so that the size of the integrated feature map at the s * th scale is the same as the original frame event feature map at the s * th scale.
  • the integrated feature map is a feature map combined with the attention mechanism , so that the integrated feature map contains feature information with higher attention.
  • a local distance field can be constructed with the pixel position j of the first matching pixel as the center, and the local distance field can be continuously fitted by a parameterized second-order polynomial, and the global polarity of this second-order polynomial
  • the small value has a closed solution.
  • step S13242 according to the pixel position of the first matched pixel point and the specified offset, determine the sub-pixel position corresponding to the pixel position, including:
  • the preset offset parameters and the preset surface parameters determine the objective function; wherein, the objective function is constructed according to the difference between the surface function and the distance function, and the distance function is constructed according to the pixel position and offset parameters , the surface function is constructed according to the surface parameters and offset parameters.
  • the preset value interval corresponding to the offset parameter the objective function is minimized to obtain the parameter value of the surface parameter, where the offset parameter is an independent variable in the objective function; according to the parameter value of the surface parameter, the specified offset is determined Amount; add the pixel position to the specified offset to get the sub-pixel position. In this way, the position of the sub-pixel can be determined accurately and effectively.
  • the distance function d(u) can be expressed as formula (5), that is, the above-mentioned local distance field
  • the surface function It can be expressed as formula (6), that is, the above-mentioned second-order polynomial
  • the objective function can be expressed as formula (7).
  • D() represents the Euclidean distance
  • u represents the offset parameter
  • [-n,n] 2 represents the preset value range
  • the value of n can be set according to actual needs, for example, it can be set to 1 , which is not limited by this embodiment of the present disclosure.
  • the preset value interval can be a local window with a size of (2n+1) 2 sampled around the sub-pixel position j, that is, the preset value interval [-n, n] 2 , or in other words, the offset parameter as an independent variable takes values from the (2n+1) 2 local window to solve the objective function.
  • A, b and c represent surface parameters.
  • A can be a 2 ⁇ 2 positive definite matrix
  • b is a 2 ⁇ 1 vector
  • c is a bias constant
  • u T represents the transpose of u
  • b T represents b Transpose.
  • the offset parameter can be a 2 ⁇ 1 vector, that is, the offset parameter can include the offset parameter on the horizontal axis and the offset parameter on the vertical axis The offset parameter.
  • formula (6) is a quadratic surface function with a global minimum point.
  • the weighted least squares method can be used, according to (2n+1) 2 known independent variables u and its corresponding distance function value d(u), by minimizing The way of the objective function (7) is solved to obtain the parameter value of the surface parameter.
  • w(u) represents the Gaussian distribution function
  • is a constant parameter
  • exp represents an exponential function with the natural constant e as the base
  • e represents the difference between the surface function and the distance function
  • ⁇ 2 represents the square of the norm.
  • the above formula (7) can be understood as finding the surface function A,b,c with the smallest difference from the distance function d(u).
  • w(u) may also be replaced by other weight distribution functions, for example, Euclidean distance, cosine distance, etc., which are not limited in this embodiment of the present disclosure.
  • w(u) can be understood as a constant matrix. It is understandable that in the process of solving the objective function, each independent variable u is derivable, and the second-order polynomial (that is, quadratic surface) fitting The process can be embedded in neural network training as a differentiable layer.
  • the off-diagonal elements in A in order to make the estimated A a positive definite matrix, you can set the off-diagonal elements in A to be all 0, and only optimize the elements on the diagonal, and if If the element has a negative number, you can use the function max(0, ) to change the negative element to 0. In this way, you can reduce the amount of calculation and quickly get the element value in the matrix A.
  • the local distance field shown in formula (5) that is, the distance function
  • the surface parameters include a first parameter (such as the above A) and a second parameter (such as the above b), the first parameter is a 2 ⁇ 2 matrix, and the second parameter is a 2 ⁇ 1 vector.
  • the parameter value of a parameter includes the two first element values on the diagonal in the matrix, and the parameter value of the second parameter includes the two second element values in the vector, that is, the parameter value of the surface parameter includes two first element value and two second element values.
  • determining the specified offset according to the parameter value of the surface parameter includes: determining the vertical axis offset and the horizontal axis offset according to the two first element values and the two second element values, and the specified offset includes Vertical axis offset and horizontal axis offset. In this way, the offset of the horizontal axis and the offset of the vertical axis can be obtained effectively.
  • formula (8) can be used to determine the vertical Axis Offset and Cross Axis Offset.
  • u * represents the specified offset
  • a (0,0) and A (1,1) represent the two first element values on the diagonal in the matrix
  • a (0,0) can represent the diagonal of the matrix
  • the upper left element value on the line A (1,1) can represent the lower right element value of the matrix diagonal
  • b (0) and b (1) can represent the two second element values in the vector
  • b (0) Tables can be divided in turn to represent the value of the first element in the vector
  • b (1) can represent the value of the second element in the vector
  • is a very small constant to ensure the stability of the division value, even if the denominator is not 0, represents the horizontal axis offset
  • the position of the sub-pixel can be determined accurately and effectively, so that the integrated feature map can be obtained based on the position of the sub-pixel.
  • the initial frame to be inserted is usually determined based on two original video frames adjacent to the initial frame to be inserted in time sequence, that is, the original video frame may include at least two frames, through the steps of the above-mentioned embodiments of the present disclosure
  • the integrated feature map of the s * th scale obtained from S13241 to step S13243 includes at least two.
  • the Estimate the frame to be inserted for optimization and obtain the target frame to be inserted, including:
  • Step S13251 According to the estimated frame event feature map at the s * th scale and at least two integrated feature maps at the s * th scale, determine the target integrated feature map at the s * th scale.
  • the integrated feature maps of each s * th scale can be obtained by referring to steps S13241 to S13243 in the above-mentioned embodiment of the present disclosure, and details are not described here.
  • the similarity between the estimated frame event feature map of the s * scale and the integrated feature map of each s * scale can be calculated, and the s * scale with the largest similarity
  • the integrated feature map is determined as the target integrated feature map of the s * th scale.
  • a Euclidean distance or a cosine distance between two feature maps may be used to characterize the similarity between the two feature maps.
  • the integrated feature map of the s * th scale with the largest similarity is used as the target integrated feature map of the s * th scale, that is, from at least two integrated feature maps of the s * th scale, Select the integrated feature map most similar to the estimated frame event feature map at the s * th scale as the target integrated feature map at the s * th scale. In this way, the target integrated feature map that is closer to the estimated frame event feature map at each scale can be quickly determined.
  • Step S13252 According to the target integration feature map, estimated frame event feature map, and fusion feature map of S * scales, optimize the estimated frame to be inserted to obtain the target frame to be inserted.
  • the estimated frame event feature map can be multi-scale
  • the fusion feature map can be obtained by performing multi-scale fusion of the initial frame feature map and the event feature map through steps S1311 to S1313 in the above-mentioned embodiment of the present disclosure, or That is, the fused feature map can be multi-scale. It should be understood that the target integration feature map, estimated frame event feature map and fusion feature map of the same scale have the same size.
  • the estimated frame to be inserted is optimized to obtain the target frame to be inserted according to the target integration feature map, the estimated frame event feature map, and the fusion feature map of S * scales, including:
  • Step S132521 According to the target integration feature map of the (SS * )th scale, the estimated frame event feature map of the (SS * )th scale, and the fusion feature map of the (SS * )th scale, the target of the (SS * )th scale is obtained Fusion feature maps.
  • the ( SS * ) scale target fusion feature map including:
  • the target integration feature map of the scale and the fusion feature map of the (SS * )th scale are channel-spliced to obtain the target splicing feature map; the target splicing feature map is filtered to obtain the target fusion feature map of the SS * scale. In this way, the target fusion feature map of the SS * th scale can be effectively obtained.
  • the residual feature of the predicted frame event feature map at the SS * th scale can be extracted through the residual network to obtain the residual feature map at the SS * th scale, and the network of the residual network is not limited in this embodiment of the present disclosure.
  • channel splicing of the residual feature map of the SS * scale, the target integration feature map of the SS * scale and the fusion feature map of the SS * scale is realized, and the obtained The target splicing feature map will not be described here.
  • the target concatenated feature map can be filtered through a convolution layer with a convolution kernel of 1 ⁇ 1 size to obtain a fusion feature map of the SS *th scale, wherein, in the convolution layer
  • the number of convolution kernels is the same as the number of channels of the target integrated feature map at the SS * th scale.
  • the target fusion feature map of the SS * scale is the target fusion feature map of the smallest scale
  • the target integration feature map of the SS * scale Figure is the same.
  • Step S132522 Perform feature fusion on the target fusion feature map of the (s * -1)th scale, the target integration feature map of the s * th scale, and the target integration feature map of the s * th scale to obtain the target fusion feature of the s * th scale picture.
  • the target fusion feature map of the s * -1th scale, the target integration feature map of the s * th scale, and the sth scale can be realized.
  • the fused feature map of the * scale performs feature fusion to obtain the target fused feature map of the s * scale.
  • the target fusion feature map of the s * -1 scale can be up-sampled to obtain the target up-sampled feature map; the target up-sampled feature map can be convoluted and nonlinearly processed to obtain the target corresponding to the up-sampled feature map Mask map: According to the target mask map, the target integration feature map of the s * scale and the fusion feature map of the s * scale are subjected to feature fusion to obtain the target fusion feature map of the s * scale.
  • Step S132523 Extract the residual feature in the s * th scale target fusion feature map to obtain the s * th scale residual feature map.
  • the residual feature in the target fusion feature map of the s * th scale can be extracted through the residual network to obtain the residual feature map of the s * th scale. It should be understood that the embodiments of the present disclosure do not limit the network structure of the residual network.
  • Step S132524 Decoding the S-th scale residual feature map to obtain decoded residual information.
  • the residual feature of the S-th scale can be decoded by specifying a decoding network to obtain the decoded residual information.
  • the network structure of the specified decoding network may correspond to the above-mentioned multi-layer convolutional layer used for extracting the original frame event feature map and the estimated frame event feature map, that is, the above-mentioned multi-layer convolutional layer can be understood as an encoding network.
  • the embodiments of the present disclosure do not limit the network structure of the residual network and the designated decoding network.
  • the residual information representing image details in the target fusion feature map can be extracted, and then the image quality of the target frame to be inserted obtained by superimposing the estimated frame to be inserted and the residual information is higher.
  • Step S132525 Superimpose the residual information on the estimated frame to be inserted to obtain the target frame to be inserted.
  • the residual information is extracted from the residual feature map, and the residual information can also be in the form of a "map". Based on this, the residual information is superimposed on the estimated frame to be inserted, which can be understood as, The residual information and the estimated frame to be inserted are fused for image fusion.
  • an image fusion technology known in the art may be used, for example, weighted average of pixel values at the same position, or superposition of pixel values, etc., which are not limited in this embodiment of the present disclosure.
  • the target integration feature map, the estimated frame event feature map, and the fusion feature map with a higher similarity to the estimated frame event feature map, and extract the representative image in the target fusion feature map
  • the residual information of the details, and then the image quality of the target frame to be inserted obtained by superimposing the estimated frame to be inserted and the residual information is higher.
  • each pixel in any frame to be interpolated can usually find the most matching pixel in two adjacent original video frames before and after the frame to be interpolated, in other words, some pixels in any frame to be interpolated A point may best match the pixel at the same position in the previous adjacent original video frame, and some pixels may best match the pixel at the same position in the subsequent adjacent original video frame.
  • step S13251 according to the feature similarity between the estimated frame event feature map of the s * th scale and at least two integrated feature maps of the s * th scale, determine the sth * Scale target integration feature maps, including:
  • any second pixel point in the estimated frame event feature map of the s * th scale from at least two integrated feature maps of the s * th scale, determine the target matching pixel point that matches the second pixel point; according to Each target matched with the second pixel matches the feature information at the pixel, and an integrated feature map of the target at the s * th scale is generated.
  • the integrated feature map of the s * th scale includes at least two, it is possible to determine the target matching pixel points that match each second pixel point, so as to obtain the estimated frame event feature of the s * th scale Graph the best-matched object integration feature map at the s * th scale.
  • the feature information includes feature values at each target matching pixel point
  • an integrated target feature map of the s * th scale is generated according to the feature information at each target matching pixel point that matches the second pixel point
  • Target integrated feature map or, according to the pixel position of each second pixel point, add the feature value at each target matching pixel point to the blank feature map with the same size as the integrated feature map of the s * th scale, and generate the first Object-integrated feature maps at s * scale.
  • Point matching targets match pixel points, including:
  • any integrated feature map of the s * th scale according to the feature similarity between the second pixel point and each pixel in the integrated feature map of the s * th scale, determine from the integrated feature map of the s * th scale and the second matching pixel matched by the second pixel;
  • the second matching pixel with the largest feature similarity among the at least two second matching pixels is determined as the target match that matches the second pixel. pixel.
  • step S13241 in the above-mentioned embodiment of the present disclosure, it can be realized that according to the feature similarity between the second pixel and each pixel in the integrated feature map of the s *th scale, from the first The second matching pixel point matching the second pixel point is determined in the integrated feature map of the s * scale, which will not be described in detail here.
  • Determining the second matching pixel point matching the second pixel point in the integrated feature map of the s * scale may include: according to the second pixel point and each pixel point in the specified window in the integrated feature map of the s * scale The feature similarity between them is to determine the second matching pixel point that matches the second pixel point from the integrated feature map of the s * th scale.
  • Euclidean distance, cosine distance, etc. may be used to calculate the feature similarity between pixels, which is not limited in this embodiment of the present disclosure.
  • the above specified window can be, for example, a local window of size (2m+1)2 around the pixel position of each second pixel point as the center, and m can be set according to actual needs, for example, it can be set to 3.
  • the disclosed embodiments are not limiting. In this way, the range of searching target matching pixels in the original frame event feature map can be narrowed, the amount of computation can be reduced, and the efficiency of determining the target matching pixels can be improved.
  • the second matching pixel point with the largest feature similarity among the at least two second matching pixels is determined as the target match that matches the second pixel point.
  • the pixel point can be understood as, for a certain second pixel point, first determine the second matching pixel point that matches the pixel point from the integrated feature map of each s *th scale; and then according to each second matching The feature similarity corresponding to the pixel point, determine the second matching pixel point with the largest feature similarity (that is, the smallest Euclidean distance or cosine distance) from each second matching pixel point, as the target matching with the second pixel point pixel.
  • formula (9) shows a method of determining the target integrated feature map of the s * th scale according to an embodiment of the present disclosure .
  • i * represents the pixel position of any second pixel in the estimated frame event feature map of the s * th scale, Represents the pixel position of the second matching pixel point on the integrated feature map of the s * th scale, Represents the pixel position of the second matching pixel on the integrated feature map of another s * th scale, Represents the eigenvalue at the second pixel point, Represents the pixel value at the second matching pixel point on the integrated feature map of the s * th scale, Represents the pixel value at the second matching pixel point on the integrated feature map of another s * scale;
  • the above formula (9) can be understood as the Euclidean distance between the two fused feature maps of the s *th scale and the estimated frame event feature map of the s * th scale, on the two fused feature maps of the s * th scale Select a feature value with the smallest Euclidean distance as the feature value on the target integrated feature map of the s * scale.
  • the integrated feature map of the s * th scale includes at least two, it is possible to determine the target matching pixel points that match each second pixel point, so as to obtain the prediction of the s * th scale
  • the first event information may be determined according to the event signal collected by the event camera, and the event signal may represent the collection point where the brightness of the object photographed by the event camera changes, and the degree of brightness change within a certain time interval.
  • the initial frame to be inserted corresponding to the video to be processed and the first event information corresponding to the initial frame to be inserted are obtained, including:
  • Step S111 According to the specified frame insertion time and the original video frame adjacent to the frame insertion time in the original video frame, an initial frame to be inserted is generated, and the video to be processed is collected by the event camera;
  • Step S112 According to the event signal collected by the event camera in the time interval corresponding to the frame insertion moment, determine the first event information.
  • the event signal is used to represent the collection point where the brightness of the object captured by the event camera changes, and the time interval within the time interval. The degree of brightness change.
  • At least one frame to be inserted can be inserted between any two original video frames, and the user can specify at least one frame interpolation moment between two original video frames, so as to calculate The optical flow from any two original video frames to each frame insertion moment, and according to the optical flow, render the original video frames through forward rendering (that is, forward mapping) to obtain the initial frame to be inserted.
  • forward rendering that is, forward mapping
  • the time interval corresponding to the frame insertion moment can be understood as the time window where the frame insertion moment is located.
  • the time interval corresponding to any frame insertion moment t can be (t- ⁇ ,t+ ⁇ ) , where ⁇ can be, for example, half or 1/3 of the duration between two original video frames adjacent to the frame insertion moment, and can be determined according to the frame rate of the video frame to be inserted, which is not limited by this embodiment of the present disclosure .
  • the frame insertion time is t
  • t can be a normalized fractional time
  • the event signals collected in the time window (t- ⁇ , t+ ⁇ ) where the frame insertion time is located can be accumulated to obtain the first event information.
  • the first event information can record the cumulative value of the event signals collected in the above time interval in the form of a "graph". In this way, the event feature map in the first event information can be extracted later.
  • the initial frame to be inserted and the first event information corresponding to the initial frame to be inserted can be effectively obtained.
  • the event signal collected at the frame insertion moment of the initial frame to be inserted can be converted into a multi-channel tensor, that is, the first event information is obtained.
  • the first event information is determined, including:
  • Step S1121 Divide the event signals collected in the time interval into M groups of event signals, where M is a positive integer.
  • the event camera will generate a series of microsecond-level event signals, which can be output in the form of event streams. Based on this, it can be understood that there are multiple event signals collected in the time interval corresponding to the frame insertion moment.
  • the value of M can be set according to actual requirements, the network structure of the feature extraction network, etc., for example, it can be set to 20, which is not limited in this embodiment of the present disclosure.
  • Step S1122 For the mth group of event signals, according to the preset signal filtering interval, filter out the event signals outside the signal filtering interval from the mth group of event signals, and obtain the mth group of target event signals, m ⁇ [1,M ].
  • the signal filtering interval may be a preset signal interval for filtering abnormal event signals, for example, the signal filtering interval may be set to [-10,10], wherein the signal filtering interval may be based on historical Settings such as experience and inherent parameters of the event camera are not limited in this embodiment of the present disclosure.
  • the abnormal event signal can be understood as an event signal collected under abnormal conditions (such as a sudden increase in the brightness of ambient light, etc.).
  • the value of the abnormal event signal will be too large or too small, and the event containing the abnormal event signal The information may not accurately represent the trajectory of the object.
  • the m groups of event signals include effective and normal event signals, so that the first event information generated based on the M groups of target event signals can accurately represent the movement trajectory of the object.
  • Step S1123 According to the polarity and signal position of each target event signal in the mth group of target event signals, the target event signals at the same signal position are accumulated to obtain the mth sub-event information, and the signal position is used to represent and target event The collection point corresponding to the signal and the coordinate position in the imaging plane of the event camera, wherein the first event information includes M sub-event information.
  • the event signal collected by the event camera has polarity, that is, there are negative numbers and positive numbers in the event signal.
  • the event camera can collect event signals and video signals at the same time.
  • the event signal represents the collection point where the brightness of the object captured by the event camera changes, the degree of brightness change within the time interval, and each collection point where the brightness changes The corresponding coordinate position will be mapped in the imaging plane of the event camera.
  • the target event signals at the same signal position are accumulated to obtain the m-th sub-event information, which can be understood as the The target event signals will be aggregated and accumulated according to their respective polarities and signal positions to obtain the mth sub-event information.
  • the first event information can record the cumulative value of the event signal collected in the above time interval in the form of a "graph", then the mth sub-event information can be understood as the mth channel of the first event information, the first event
  • the information can be a graph of M channels, or a tensor of M channels.
  • the event signal collected in the time interval corresponding to the frame insertion time can be effectively converted into multi-channel first event information, so as to facilitate subsequent extraction of the event feature map of the first event information.
  • FIG. 4 shows a schematic diagram of an image processing network according to an embodiment of the present disclosure, as shown in FIG. 4
  • the image processing network includes a complementary information fusion network and a sub-pixel motion attention network
  • the complementary information fusion network includes a dual-branch feature extraction sub-network (ie, two Unets in Figure 4) and a multi-scale adaptive fusion sub-network (ie, AAFB in Figure 4).
  • step S12 feature extraction is performed on the initial frame to be inserted and the information of the first event, respectively, to obtain the initial frame feature map corresponding to the initial frame to be inserted and the first event
  • the event feature map corresponding to the information including: through the double-branch feature extraction sub-network, perform feature extraction on the initial frame to be inserted (I 0 ⁇ 1 and I 0 ⁇ 2 ) and the first event information (E 1 ), and obtain the initial frame to be inserted
  • each branch of the dual-branch feature extraction network can use a UNet network, and each UNet network can include 5 sets of convolutional layers, and the first set of convolutional layers retains the input The resolution of the data, while the other convolutional layers downsample the input feature map to 1/2 of the original in the length and width dimensions, and the 5 sets of convolutional layers expand the number of feature channels to 32, 64, 128, 256, 256.
  • the network structure of the above two-branch feature extraction network is an implementation method provided by the embodiments of the present disclosure. In fact, those skilled in the art can design the network structure of the two-branch feature extraction network according to requirements. For the two-branch feature The network structure of the extraction network is not limited in the embodiments of the present disclosure.
  • the initial frame feature map f s is a feature map of 5 scales
  • f s can represent the initial frame feature map of the sth scale
  • the event feature map e s is a feature map of 5 scales
  • e s represents the feature map of the sth scale
  • Event feature map at s scale that is, s ⁇ ⁇ 0,1,2,3,4 ⁇ .
  • f 0 represents the initial frame feature map at the 0th scale
  • e 0 represents the event feature map at the 0th scale
  • X 0 represents the fusion feature map at the 0th scale, other f 1 ⁇ f 4 , e 1 ⁇ e 4 , X 1 ⁇ X 4 and so on, without going into details.
  • step S131 generating an estimated frame to be inserted according to the initial frame feature map and the event feature map includes: implementing the initial frame feature map f s through a multi-scale adaptive fusion sub-network and the event feature map f s to generate estimated frames to be interpolated In this manner, an estimated frame to be inserted can be quickly and accurately generated.
  • step S132 according to the original video frame adjacent to the initial frame to be inserted and the second event information corresponding to the original video frame, the estimated frame to be inserted is optimized to obtain the target frame to be inserted Frame, including: through the sub-pixel motion attention network, optimize the estimated frame to be inserted according to the original video frame adjacent to the initial frame to be inserted and the second event information corresponding to the original video frame, and obtain the target frame to be inserted .
  • the estimated frame to be inserted can be accurately optimized, and a target frame to be inserted with higher image quality can be obtained.
  • I 0 and I 2 represent the original video frame adjacent to the frame interpolation moment of the initial frame to be interpolated
  • E 0 and E 2 represent the original video frame (I 0 and I 2 )
  • the corresponding second event information, ⁇ I 0 , E 0 > and ⁇ I 2 , E 2 > represent two original frame event combination information, Represents estimated frame event combination information.
  • the sub-pixel motion attention network may include a feature extraction sub-network.
  • feature extraction is performed on the estimated frame event combination information and the original frame event combination information through the feature extraction sub-network to obtain the estimated The estimated frame event feature map corresponding to the frame event combination information And the original frame event feature map corresponding to the original frame event combination information ( and ).
  • the feature extraction module can include three convolutional layers with parameter sharing, and Can be feature maps of 3 scales, s * ⁇ ⁇ 2,3,4 ⁇ .
  • the sub-pixel motion attention network may include a sub-pixel attention sub-network and a sub-pixel integration sub-network.
  • the original frame event feature map is adjusted to obtain the integrated feature map ( and ).
  • step S13251 through the sub-pixel integration sub-network, the estimated frame event feature map according to the s * th scale is realized, respectively, and at least two integrated feature maps of the s * th scale The feature similarity of , determine the target integrated feature map of the s * scale
  • the sub-pixel motion attention network may include a multi-scale adaptive fusion sub-network AAFB, a residual network and a decoding network (not shown in Figure 4), in a possible implementation, in step In S132521, the residual feature of the estimated frame event feature map at the SS * scale can be extracted through the residual network, and the residual feature map at the SS * scale can be obtained (R 2 in Figure 4 represents the residual feature of the 2 scale ), and then the residual feature map of the SS * scale (such as R2), the target integration feature map of the SS * scale (such as ) and the fused feature map of the SS * th scale (such as X 2 ) are channel spliced and filtered to obtain the target fused feature map of the SS * th scale.
  • the residual feature map of the SS * scale such as R2
  • the target integration feature map of the SS * scale such as
  • the fused feature map of the SS * th scale such as X 2
  • step S132522 through the multi-scale adaptive fusion subnetwork AAFB, the target fusion feature map of the s * -1th scale, the target integration feature map of the s * th scale, and the s *th scale
  • the fused feature map of the scale performs feature fusion to obtain the target fused feature map of the s * th scale.
  • step S132523 the residual feature in the s * scale target fusion feature map is extracted through the residual network to obtain the s * scale residual feature map
  • R 3 represents the residual feature map of the third scale
  • R 4 represents the residual feature map of the fourth scale.
  • step S132524 the residual feature map of the S-th scale (such as R 4 ) is decoded by the decoding network to obtain the decoded residual information R s .
  • the residual information R s is superimposed on the estimated frame to be inserted , get the target frame to be inserted It can be expressed as:
  • the image processing network shown in FIG. 4 is an implementation mode provided by the embodiment of the present disclosure. In fact, those skilled in the art can design a video frame insertion method for implementing the embodiment of the present disclosure according to actual needs.
  • the image processing network is not limited in this embodiment of the present disclosure.
  • the target frame to be inserted can be accurately and efficiently generated through the image processing network.
  • the method further includes:
  • the initial image processing network is trained to obtain the image processing network.
  • the sample video includes sample intermediate frames and sample video frames adjacent to the sample intermediate frames.
  • the sample intermediate frame may be an intermediate video frame between two sample video frames in the sample video, that is, the sample intermediate frame is also a sample The original video frame in the video.
  • the initial image processing network is trained to obtain the image processing network, including:
  • the network parameters of the initial image processing network are updated until the loss meets the preset condition, and the image processing network is obtained.
  • step S111 in the above-mentioned embodiment of the present disclosure can be referred to to realize the generation of the initial intermediate frame according to the intermediate moment corresponding to the sample intermediate frame and the sample video frame, that is, to calculate the sample video through the above-mentioned optical flow estimation algorithm known in the art.
  • the optical flow from the frame to the intermediate moment, and according to the optical flow, the sample video frame is rendered to obtain the initial intermediate frame by means of forward rendering (ie, forward mapping).
  • inputting the sample video frame and the initial intermediate frame into the initial image processing network to obtain the predicted intermediate frame output by the initial image processing network can refer to the implementation of generating the target frame to be inserted through the image processing network in the above-mentioned embodiments of the present disclosure. The process will not be repeated here.
  • a loss function known in the art such as Charbonnier Loss (Charbonnier Loss)
  • Charbonnier Loss can be used to calculate the loss between the predicted intermediate frame and the sample intermediate frame, for which this disclosure Examples are not limiting.
  • the preset condition may include, for example: loss converges, loss is set to 0, and the number of iterations reaches a specified number, etc., which are not limited by this embodiment of the present disclosure.
  • the trained image processing network can accurately and efficiently generate the target frame to be inserted.
  • the image processing network includes a complementary information fusion network and a sub-pixel motion attention network.
  • the complementary information fusion network can be trained first, and after the loss of the complementary information fusion network converges, the complementary information can be fixed.
  • the network parameters of the fusion network are followed by training a sub-pixel motion attention network.
  • the initial image processing network includes an initial complementary information fusion network and an initial sub-pixel motion attention network
  • the predicted intermediate frame includes: the first predicted intermediate frame output by the initial complementary information fusion network, and the initial sub-pixel The second predicted intermediate frame output by the motion attention network
  • the network parameters of the initial image processing network are updated until the loss meets the preset conditions, and the image processing network is obtained, including:
  • the sample prediction intermediate frame output by the complementary information fusion network is input to the initial sub-pixel motion attention network to obtain the second predicted intermediate frame;
  • the network parameters of the initial sub-pixel motion attention network are updated until the second loss converges, and the sub-pixel motion attention network is obtained.
  • the above-mentioned training process of the initial image processing network can be understood as a two-stage network training.
  • the initial complementary information fusion network is first trained, and after the first loss of the initial complementary information fusion network converges, the network parameters of the initial complementary information fusion network are fixed to obtain a complementary information fusion network.
  • the intermediate frame is predicted by using the sample output from the trained complementary information fusion network as the input data of the initial sub-pixel motion attention network, and the second predicted intermediate frame output by the initial sub-pixel motion attention network is obtained.
  • the network parameters of the initial sub-pixel motion attention network are updated until the second loss converges, and a trained sub-pixel motion attention network is obtained.
  • the image processing network can be trained in stages to improve the training efficiency of the image processing network.
  • scale of the feature map in the embodiment of the present disclosure can be understood as the feature map extracted at different levels of the neural network, or in other words, the scale is used to distinguish the feature maps extracted from different levels of networks.
  • Size can be understood as the length, width and height of feature maps of different scales, or the resolution of feature maps of different scales. It should be understood that the size of feature maps at different scales may be different, and the size of feature maps at the same scale may be the same.
  • An embodiment of the present disclosure provides a video frame interpolation method, which includes: a complementary information fusion stage and a sub-pixel attention image quality enhancement stage.
  • the purpose of the embodiment of the present disclosure is to synthesize two original video frames at any frame interpolation moment of t ⁇ (0,1) and insert an intermediate frame where t is a normalized fractional moment.
  • Relevant event information is available within a localized time window
  • phase of complementary information fusion firstly, using the calculated optical flow, the and The pixels in move to the position aligned with the video frame at the moment of frame insertion. This process will output 2 rough initial frames to be interpolated. Obvious errors can be observed in the initial frame to be interpolated where the optical flow estimation is inaccurate.
  • the event information from the moment of frame insertion can be used Mining complementary motion trajectory information to correct these errors.
  • the embodiment of the present disclosure uses two Unets (any relevant multi-scale feature extraction network can be used) to extract event information and video signal features respectively, and then through an adaptive appearance complementary fusion network (AAFB in Figure 4), Fuse the extracted two features, and finally output the optimized estimated frame to be inserted
  • AAFB adaptive appearance complementary fusion network
  • the embodiment of the present disclosure uses an attention mechanism to optimize the predicted frame to be inserted in the second stage.
  • the combined information of the estimated frame to be inserted and the corresponding event information can be As query information, the combination information of its adjacent original video and corresponding event information As a key value, the query information and the key value information are more accurately matched through the attention mechanism of sub-pixel precision.
  • the key value information related to each query information can be retrieved more accurately, and Use the sub-pixel precision image block displacement method to aggregate related content, and finally output a multi-scale context feature (ie, the above-mentioned integrated feature map); and then use the multi-scale feature generated in the fusion stage of the context feature and complementary information to further use AAFB Fusion, and output further optimized target frames to be inserted through several residual network processing.
  • a multi-scale context feature ie, the above-mentioned integrated feature map
  • the stage of fusion of appearance complementary information can be calculated using optical flow estimation algorithms known in the art, respectively and Respectively to the optical flow at the moment of frame insertion, and according to the optical flow and
  • the initial frame to be inserted is obtained by rendering through the forward rendering method and As an output of the dual-branch feature extraction network.
  • the event signal is time-dense
  • the embodiments of the present disclosure will The signals are equally spaced and aggregated into 20-channel event information as another input of the dual-branch feature extraction network.
  • the dual-branch feature extraction network can be a dual-branch Unet.
  • an embodiment of the present disclosure proposes a multi-scale adaptive aggregation network (AAFB in Figure 4 ), which can effectively aggregate the features of the video signal and the features of the event signal at a multi-scale level.
  • AAFB multi-scale adaptive aggregation network
  • the multi-scale adaptive aggregation network proposed by the embodiment of the present disclosure is a gradual aggregation process from coarse to fine scale.
  • the fusion features of each scale can be recursive It is represented by formula (1).
  • f s and e s can be regarded as different perspective expressions of the same latent reconstruction information.
  • the embodiments of the present disclosure draw on the idea of re-normalization in related technologies, so that features expressed from different perspectives can be aligned in the same space while maintaining fine-grained spatial details.
  • two sets of independent convolutional layers can be used to learn spatially variable scales and offsets c f , b f or c e , be e , and then each random variable can be used with the above Formulas (2-1) and (2-2) are transformed into fusionable feature maps y e and y f .
  • the event signal has a good perception ability for the boundary of the moving object, because this kind of motion often causes the rapid brightness change of the image, and the optical flow method based on the pure video signal, the estimated value in this area is often Unreliable.
  • event information captured by event cameras is not as reliable as information extracted based on video signals.
  • the upsampling feature map corresponding to the fusion feature map of the s-1th scale can be A fusion soft mask m is extracted through a convolutional layer and a sigmoid layer, and the mask m is used to adaptively fuse the two complementary information. This process can refer to the above formula (3).
  • Formulas (2-1), (2-1) and (3) form a recursive fusion process. Since the fusion process is an affine transformation, in order to increase the nonlinearity of each multi-scale adaptive fusion network, each At the output of a network, a 3x3 convolution operation and a LeakyRelu activation function are inserted, and all the operations mentioned above are combined into an AAFB network.
  • the embodiments of the present disclosure adopt a lightweight attention mechanism to capture context information, so as to further optimize the image quality effect of the frame to be interpolated.
  • the input of the sub-pixel motion attention stage is mainly the combined information of video signal and event information
  • the combination information is input to the convolutional network with 3 layers of parameter sharing, so as to output the features of 3 scales ⁇ v s
  • the traditional attention mechanism often aggregates information through a soft attention mechanism.
  • softmax normalizes this correlation matrix, and then aggregates all position information in the "value” by weighted summation. For image synthesis tasks, this can blur immediate features and degrade the quality of the final composition.
  • the embodiment of the present disclosure adopts the hard attention mechanism, because the hard attention mechanism will record the position of the best match (that is, the highest similarity), that is, the "key” with the smallest Euclidean distance from a certain feature vector in the "query”. "s position.
  • subpixel-accurate attention offsets can be computed on low-resolution feature maps, and scaled up and applied to high-resolution feature maps. , this method can alleviate the accuracy loss to a certain extent.
  • p ⁇ [-m,m] 2 ⁇ can be organized as (2m+1) distances between 2 "query” vectors and "key” vectors , where p * is where the minimum distance lies.
  • the local distance field centered at p * can be continuously fitted by a parameterized second-order polynomial whose global minimum has closed solutions.
  • the shape of the local distance field can be corrected and estimates obtained with sub-pixel accuracy.
  • the original frame event feature map of the s * th scale is n times the size of the smallest scale preset frame event feature map in the length and width dimensions.
  • an image block of n ⁇ n size can be cropped by bilinear interpolation method on the original frame event feature map at the s * th scale with j * as the center. Then size splicing is performed on multiple image blocks to obtain an integrated feature map with the same size as the original frame event feature map at the s * th scale and after information reorganization.
  • the strategy of sub-pixel fitting and image block movement can be used on the two original frame event feature maps at the same time, resulting in two integrated feature maps after reshaping, and then you can refer to formula (9), According to the distance between the features, the feature with the smallest distance is preferentially retained on the two integrated feature maps to generate the target integrated feature map.
  • a multi-scale target integration feature map can be obtained, and then the fusion feature map and target integration feature map output in the complementary information fusion stage can be integrated through the above-mentioned multi-scale adaptive fusion network.
  • the highest resolution feature map after integration will finally pass through a decoder and output the estimated frame to be interpolated
  • the optimization residual R 1 of the target frame to be interpolated It can be expressed as
  • the local time window (t- ⁇ , t+ ⁇ ) can be equally spaced into 20 groups, where ⁇ represents the interval between two consecutive needles half the time.
  • the event signals falling in the same group will be aggregated according to their own polarity according to the pixel position, and the maximum and minimum value range will be clipped to [-10,10], and finally a 20-channel tensor will be formed That is, the first event information is obtained.
  • a dual-branch Unet network can be used for the dual-branch feature extraction network used.
  • the Unet network of each branch has 4 scales, and the encoders of each scale pass through a set of convolutional networks. Expand the number of feature channels to 32, 64, 128, 256, 256. Among them, the first set of convolutional networks retains the input resolution, while the other convolutional networks downsample the feature map to the original 1/ 2.
  • the decoder adopts a symmetrical structure design and performs skip connections with the corresponding encoder features. After multi-scale feature fusion, the highest-resolution feature layer passes through two 32-channel convolutional layers to produce the final output.
  • the first is the complementary information fusion stage.
  • the event signal related to the frame insertion time is used, and the left and right two nearest neighbors of the frame insertion time are used.
  • the original video frame is subjected to feature extraction and complementary fusion to synthesize a preliminary estimated frame to be interpolated. After that, there is an image quality enhancement stage based on sub-pixel motion attention.
  • the synthesized frame to be interpolated is estimated to be inserted, and the event signal related to it is used again, as well as the two nearest neighbors, left and right original video frames and their related event signals.
  • Stage optimization so as to obtain a target frame to be interpolated with less artificial traces and better image quality.
  • a certain number of video frame insertion processes between two original video frames can be realized.
  • the event signal collected by the event camera and the low frame rate video signal can be used to synthesize the target frame to be interpolated, so as to perform video frame interpolation and obtain a high frame rate video signal.
  • the embodiment of the present disclosure first moves the pixels of the two original frames around the frame insertion time through the optical flow estimation algorithm to obtain the initial frame to be inserted, and uses it as the input of the video signal feature extraction network, and then extracts The event signal related to the initial frame to be inserted is used as the input of the event signal feature extraction network.
  • two multi-scale feature extraction networks with independent parameters are used to extract the features of the video signal and the event signal respectively, and two multi-scale feature maps are obtained, and then a multi-scale adaptive information fusion network is used to extract the features of the two multi-scale
  • the feature maps are fused, and the final synthesized feature map is passed through a decoder to output a preliminary synthesized 3-channel color prediction frame to be inserted.
  • the embodiment of the present disclosure superimposes the estimated frame to be inserted and the left and right original video frames at the moment of frame insertion synthesized in the complementary information fusion stage with their respective related event signals,
  • the same feature extraction network is used to extract the features of the three groups of signals and output multi-scale features.
  • the embodiment of the present disclosure uses the attention mechanism on the feature map of the lowest scale, and uses the feature map corresponding to the estimated frame to be interpolated as a query, and the other two original video frames correspond to
  • the feature map of the feature map is used as the key value, and the feature position most relevant to the feature of each spatial position of the estimated frame to be interpolated is extracted through the hard attention mechanism, and then the local distance field around the feature is used to fit a quadratic surface.
  • the maximum value of the quadratic surface is used to find the most similar position with sub-pixel precision.
  • the information corresponding to the two keys is re-integrated through the method of bilinear interpolation, and this integration strategy is scaled up proportionally.
  • the other scale features are similarly combined, and by retaining the maximum similarity, the two integrated information are finally fused into a multi-scale information.
  • the integrated multi-scale information, the estimated low-scale information corresponding to the frame to be inserted, and the information extracted in the complementary information fusion stage are again subjected to feature fusion and decoding through multi-scale adaptive fusion, and finally the residual poor information.
  • the target frame to be inserted with better image quality is obtained.
  • the image quality of the initial frame to be interpolated is directly corrected through the motion trajectory information represented by the event information, and a more accurate attention mechanism is provided, by more accurately retrieving and utilizing motion-related context information To improve the quality of the predicted frames to be interpolated, it has better generalization.
  • the embodiment of the present disclosure proposes a method for complementary fusion of video signals and event signals.
  • the default motion trajectory information when estimating the motion of the object to be inserted into the frame is made up, and
  • the complete video signal is recorded by using the non-moving area, which makes up the information of the event signal for the non-moving area.
  • the embodiment of the present disclosure proposes a sub-pixel precision motion attention mechanism, which can extract sub-pixel precision attention sensitive to object motion on the low-resolution feature map, so that high-resolution feature maps can be directly obtained on the resolution feature map. Attention information, thereby constructing a more accurate attention mechanism, and improving image quality by more accurately retrieving and utilizing motion-related context information.
  • the training method using the unsupervised image processing network is more in line with the actual use scene of the event camera, reduces the requirement for training data, and improves the generalization of network training.
  • the low frame rate video signal captured by the event camera and the event signal of the corresponding scene can be used to synthesize the high frame rate video signal of the scene; slow motion playback can also be completed , Improving the video bit rate (fluency), stabilizing the image (electronic image stabilization, video anti-shake) and other image processing tasks.
  • the video frame insertion method according to the embodiments of the present disclosure can be applied to any product constructed with an event camera that requires a video frame insertion function, such as video playback software, or slow-motion playback of video security software.
  • this disclosure also provides a video frame insertion device, electronic equipment, computer-readable storage medium, and programs, all of which can be used to implement any video frame insertion method provided by this disclosure, and refer to the corresponding technical solutions and descriptions in the method section Corresponding records are not repeated here.
  • Fig. 5 shows a block diagram of a video frame insertion device according to an embodiment of the present disclosure. As shown in Fig. 5, the device includes:
  • the obtaining module 101 is configured to obtain an initial frame to be inserted corresponding to the video to be processed, and first event information corresponding to the initial frame to be inserted, and the first event information is configured to represent the motion of an object in the initial frame to be inserted track;
  • the feature extraction module 102 is configured to perform feature extraction on the initial frame to be inserted and the first event information respectively, to obtain an initial frame feature map corresponding to the initial frame to be inserted and an event feature corresponding to the first event information picture;
  • the generating module 103 is configured to generate a target frame to be inserted according to the initial frame feature map and the event feature map;
  • the frame insertion module 104 is configured to insert the target frame to be inserted into the video to be processed to obtain a processed video.
  • the generation module includes: an estimated frame generation submodule configured to generate an estimated frame to be inserted according to the initial frame feature map and the event feature map; the estimated frame optimization The submodule is configured to, according to the original video frame in the video to be processed, adjacent to the frame insertion time of the initial frame to be inserted, and the second event information corresponding to the original video frame, the estimated The frame interpolation is optimized to obtain the target frame to be interpolated, and the second event information is configured to represent the motion track of the object in the original video frame.
  • the initial frame feature map includes S scales
  • the event feature map includes S scales
  • S is a positive integer
  • the estimated frame generation submodule includes: a first The fusion unit is configured to obtain the fusion feature map of the 0th scale according to the initial frame feature map of the 0th scale and the event feature map of the 0th scale; the alignment unit is configured to be based on the fusion feature map of the (s-1)th scale, The initial frame feature map of the sth scale is spatially aligned with the event feature map of the sth scale to obtain the fusionable initial frame feature map of the sth scale and the fusionable event feature map of the sth scale; the second fusion unit is configured as According to the fused feature map of the (s-1)th scale, the fused initial frame feature map of the sth scale, and the fused event feature map of the sth scale, the fused feature map of the sth scale is obtained; decoding The unit is configured to decode the fused feature map
  • the alignment unit includes: an upsampling subunit configured to upsample the fused feature map of the (s-1)th scale to obtain an upsampled feature map, and the upsampling
  • the sampling feature map is the same size as the initial frame feature map of the s-th scale and the event feature map of the s-th scale
  • the first conversion subunit is configured to The first spatial conversion relationship between the initial frame feature maps is used to obtain the fusionable initial frame feature map of the sth scale
  • the second conversion subunit is configured to be based on the upsampled feature map and the event of the sth scale
  • the second spatial transformation relationship between the feature maps is to obtain the s-th scale fusion event feature map; wherein, the s-th scale fusion initial frame feature map, the s-th scale fusion event feature map in the same feature space as the upsampled feature map.
  • the first spatial transformation relationship is based on the first pixel size scaling information and the first offset information during spatial transformation of the initial frame feature map of the sth scale, and the upper The feature information of the sampling feature map is determined;
  • the second spatial conversion relationship is based on the second pixel size scaling information and the second offset information during the space conversion of the event feature map of the sth scale, and the upsampling The feature information of the feature map is determined; wherein, the pixel size scaling information represents the size scaling ratio of each pixel in the space transformation, and the offset information represents the position offset of each pixel in the space transformation.
  • the second fusion unit includes: a processing subunit configured to perform convolution processing and nonlinear processing on the upsampled feature map to obtain a mask map corresponding to the upsampled feature map , wherein the upsampling feature map is obtained by upsampling the fusion feature map of the (s-1)th scale; the fusion subunit is configured to, according to the mask map, convert the sth scale
  • the initial frame feature map can be fused with the s-scale fused event feature map to obtain the s-scale fused feature map.
  • the fusion subunit includes: a first fusion circuit configured to combine the fusionable initial frame feature map of the sth scale with the sth scale The fusion event feature map of the sth scale is used for feature fusion to obtain the initial fusion feature map of the sth scale; the processing circuit is configured to perform convolution processing and nonlinear processing on the initial fusion feature map of the sth scale to obtain the sth scale Scale fused feature maps.
  • the first fusion circuit is configured to: calculate the Hadamard product between the mask map and the feature map of the fusionable event at the sth scale; according to the mask map For the corresponding reverse mask map, calculate the product between the reverse mask map and the fusionable initial frame feature map of the sth scale; add the Hadamard product to the product to obtain the The initial fused feature map at the s-th scale.
  • the first fusion unit includes: a splicing subunit configured to perform channel splicing of the initial frame feature map of the 0th scale and the event feature map of the 0th scale to obtain Splicing feature maps; a filtering subunit configured to perform filtering processing on the splicing feature maps to obtain the fusion feature map of the 0th scale.
  • the estimated frame optimization submodule includes: a first combination unit configured to combine the estimated frame to be inserted with the first event information to obtain an estimated frame event combination information; a second combination unit configured to combine the original video frame with the second event information to obtain original frame event combination information; an extraction unit configured to combine the estimated frame event information with the estimated frame event information respectively Feature extraction is performed on the original frame event combination information to obtain an estimated frame event feature map corresponding to the estimated frame event combination information and an original frame event feature map corresponding to the original frame event combination information; the adjustment unit is configured to The estimated frame event feature map, adjusting the original frame event feature map to obtain an integrated feature map; the optimization unit is configured to, according to the integrated feature map, the estimated frame event feature map and the fusion feature map, The estimated frame to be inserted is optimized to obtain the target frame to be inserted, and the fusion feature map is obtained by multi-scale fusion of the initial frame feature map and the event feature map.
  • the estimated frame event feature map includes S * scales
  • the original frame event feature map includes S * scales, 1 ⁇ S * ⁇ S, S * is a positive integer, s * ⁇ [(SS * ), S)
  • the size of the estimated frame event feature map of the (SS * )th scale is I ⁇ I, where I is a positive integer
  • the adjustment unit includes: a first determination subunit , configured to, for any first pixel point in the estimated frame event feature map of the (SS * )th scale, determine from the original frame event feature map of the (SS * )th scale that matches the first pixel point The first matching pixel point;
  • the second determination subunit is configured to determine the sub-pixel position corresponding to the pixel position according to the pixel position of the first matching pixel point and a specified offset, and the specified offset is a decimal;
  • the adjustment subunit is configured to adjust the original frame event feature map of the s * th scale according to the I ⁇ I sub-pixel positions, to obtain the integrated feature map of the
  • the first determination subunit includes: a calculation circuit configured to, for any first pixel, calculate the difference between the first pixel and the (SS * )th scale The feature similarity between each pixel in the specified window in the original frame event feature map, the specified window is determined according to the pixel position of the first pixel; the first determination circuit is configured to Among the pixel points in the specified window, the pixel point corresponding to the maximum feature similarity is determined as the first matching pixel point.
  • the second determination subunit includes: a second determination circuit configured to determine an objective function according to the pixel position, a preset offset parameter, and a preset surface parameter, and solve The circuit is configured to minimize and solve the objective function according to the preset value range corresponding to the offset parameter to obtain the parameter value of the surface parameter, wherein the offset parameter is a value in the objective function Independent variable; a third determination circuit configured to determine the specified offset according to the parameter value of the surface parameter; an addition circuit configured to add the pixel position to the specified offset to obtain the specified offset The subpixel location.
  • the objective function is constructed according to the difference between a surface function and a distance function
  • the distance function is constructed according to the pixel position and the offset parameter
  • the surface function is constructed from the surface parameters and the offset parameters.
  • the surface parameters include a first parameter and a second parameter, the first parameter is a 2 ⁇ 2 matrix, the second parameter is a 2 ⁇ 1 vector, and the first The parameter value of the parameter includes two first element values on the diagonal in the matrix, and the parameter value of the second parameter includes two second element values in the vector, wherein the third determining circuit , configured to determine the vertical axis offset and the horizontal axis offset according to the two first element values and the two second element values, the specified offset includes the vertical axis offset and the Horizontal axis offset.
  • the size of the original frame event feature map of the s * th scale is n times the estimated frame event feature map of the (SS * )th scale
  • the adjustment subunit including: a cropping circuit, configured to center on each of the sub-pixel positions, and cut out I ⁇ I feature blocks of n ⁇ n size from the original frame event feature map of the s * th scale;
  • the splicing circuit is configured to perform size splicing on the I ⁇ I feature blocks of n ⁇ n size according to the I ⁇ I sub-pixel positions to obtain the integrated feature map of the s * th scale, the The integrated feature map at the s * th scale is the same size as the original frame event feature map at the s * th scale.
  • the original video frame includes at least two frames
  • the integrated feature map of the s * th scale includes at least two
  • the optimization unit includes: a third determining subunit configured to The estimated frame event feature map of the s * th scale and at least two integrated feature maps of the s * th scale determine the target integrated feature map of the s * th scale; the optimization subunit is configured to integrate features according to the target of the S * th scale , the estimated frame event feature map, and the fusion feature map, the estimated frame to be inserted is optimized to obtain the target frame to be inserted.
  • the third determination subunit includes: a fourth determination circuit configured to, for any second pixel in the estimated frame event feature map of the s * th scale, from In the at least two integrated feature maps of the s * th scale, the target matching pixel points matching the second pixel point are determined; the generating circuit is configured to, according to each target matching pixel point matching the second pixel point The feature information at the point is generated to generate the target integration feature map of the s * th scale.
  • the fourth determination circuit is configured to: for any integrated feature map of the s * th scale, according to the second pixel point and the integrated feature map of the s * th scale The feature similarity between each pixel point is to determine the second matching pixel point matching the second pixel point from the integrated feature map of the s * th scale; according to at least two of the second matching pixel points For each corresponding feature similarity, the second matching pixel with the largest feature similarity among the at least two second matching pixels is determined as a target matching pixel matching the second pixel.
  • the optimization subunit includes: a second fusion circuit configured to integrate the feature map according to the (SS * )th scale target, and the (SS * )th scale estimated frame event feature map And the fusion feature map of the (SS * ) scale, to obtain the target fusion feature map of the (SS * ) scale;
  • the third fusion circuit is configured to perform the target fusion feature map of the (s * -1) scale, the s * The target integration feature map of the scale and the fusion feature map of the s * th scale perform feature fusion to obtain the target fusion feature map of the s * th scale;
  • the extraction circuit is configured to extract the residual feature in the target fusion feature map of the s * th scale , to obtain the residual feature map of the s * th scale;
  • the decoding circuit is configured to decode the residual feature map of the S scale to obtain the decoded residual information;
  • the superposition circuit is configured to superimpose the residual information From the estimated frame to be inserted, the target frame to be inserted is obtained
  • the second fusion circuit is configured to: extract the residual feature of the predicted frame event feature map of the (SS * )th scale, and obtain the residual error of the (SS * )th scale Feature map; performing channel stitching on the residual feature map of the (SS * ) scale, the target integration feature map of the (SS * ) scale, and the fusion feature map of the SS * scale to obtain the target splicing feature Figure: performing filtering processing on the target splicing feature map to obtain the target fusion feature map of the (SS * )th scale.
  • the acquisition module includes: an initial generation submodule, configured to be based on the specified frame insertion moment, and the original video frame adjacent to the frame insertion moment in the video to be processed , generating the initial frame to be inserted, the video to be processed is collected by an event camera; the event information generating submodule is configured to collect event signals according to the time interval corresponding to the frame insertion moment by the event camera, The first event information is determined, and the event signal is used to characterize the collection point where the brightness of the object photographed by the event camera changes, and the degree of brightness change within the time interval.
  • the event information generation submodule includes: a division unit configured to divide the event signals collected in the time interval into M groups of event signals, where M is a positive integer; a screening unit , configured to filter out event signals outside the signal filtering interval from the mth group of event signals according to the preset signal filtering interval for the mth group of event signals, to obtain the mth group of target event signals, m ⁇ [1, M]; the accumulation unit is configured to accumulate the target event signals at the same signal position according to the polarity and signal position of each target event signal in the mth group of target event signals to obtain the mth sub-event information, the signal position is used to characterize the coordinate position of the acquisition point corresponding to the target event signal in the imaging plane of the event camera; wherein, the first event information includes M sub-event information.
  • the video frame insertion device is implemented through an image processing network
  • the image processing network includes a complementary information fusion network and a sub-pixel motion attention network
  • the complementary information fusion network includes a double branch A feature extraction sub-network and a multi-scale adaptive fusion sub-network
  • the feature extraction module is configured to: use the dual-branch feature extraction sub-network to separately perform the initial frame to be inserted and the first event information
  • Feature extraction is to obtain an initial frame feature map corresponding to the initial frame to be inserted and an event feature map corresponding to the first event information.
  • the estimated frame generation sub-module is configured to generate an estimated frame to be inserted according to the initial frame feature map and the event feature map through the multi-scale adaptive fusion sub-network. frame; and/or, the estimated frame optimization submodule is configured to, through the sub-pixel motion attention network, according to the original video frame adjacent to the initial frame to be inserted and the first video frame corresponding to the original video frame Two event information, optimizing the estimated frame to be inserted to obtain the target frame to be inserted.
  • the device further includes: a network training module configured to train an initial image processing network according to a sample video to obtain the image processing network, the sample video includes sample intermediate frames and the A sample video frame adjacent to the sample intermediate frame; wherein, the network training module includes: an intermediate frame generation submodule configured to generate an initial intermediate frame according to the intermediate moment corresponding to the sample intermediate frame and the sample video frame; the input sub A module configured to input the sample video frame and the initial intermediate frame into the initial image processing network to obtain a predicted intermediate frame output by the initial image processing network; the update sub-module is configured to The loss between the frame and the intermediate frame of the sample is updated, and the network parameters of the initial image processing network are updated until the loss satisfies a preset condition, and the image processing network is obtained.
  • a network training module configured to train an initial image processing network according to a sample video to obtain the image processing network, the sample video includes sample intermediate frames and the A sample video frame adjacent to the sample intermediate frame
  • the network training module includes: an intermediate frame generation submodule configured to generate an
  • the initial image processing network includes an initial complementary information fusion network and an initial sub-pixel motion attention network
  • the predicted intermediate frame includes: the first predicted intermediate frame output by the initial complementary information fusion network frame, and the second predicted intermediate frame output by the initial sub-pixel motion attention network
  • the updating submodule includes: a first updating unit configured to, according to the first predicted intermediate frame and the sample intermediate For the first loss between frames, update the network parameters of the initial complementary information fusion network until the first loss converges to obtain the complementary information fusion network
  • the input unit is configured to output samples from the complementary information fusion network
  • the predicted intermediate frame is input to the initial sub-pixel motion attention network to obtain the second predicted intermediate frame
  • the second update unit is configured to, according to the second predicted intermediate frame and the sample intermediate frame
  • the second loss is to update the network parameters of the initial sub-pixel motion attention network until the second loss converges to obtain the sub-pixel motion attention network.
  • the first event information representing the motion track of the object in the initial frame to be inserted to optimize the initial frame to be inserted in the video to be processed, so that the image quality of the generated target frame to be inserted is higher than that of the initial frame.
  • Frames to be inserted so as to improve the picture quality of the processed video, and help reduce the jitter and distortion of the picture in the processed video.
  • the functions or modules included in the device provided by the embodiments of the present disclosure can be used to execute the methods described in the method embodiments above, and its specific implementation can refer to the description of the method embodiments above. For brevity, here No longer.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, on which computer program instructions are stored, and the above-mentioned method is implemented when the computer program instructions are executed by a processor.
  • Computer readable storage media may be volatile or nonvolatile computer readable storage media.
  • An embodiment of the present disclosure also proposes an electronic device, including: a processor; a memory configured to store instructions executable by the processor; wherein the processor is configured to invoke the instructions stored in the memory to execute the above method.
  • Embodiments of the present disclosure may also provide a computer program, including computer readable codes.
  • a processor in the electronic device executes the method of any one of the above embodiments. .
  • An embodiment of the present disclosure also provides a computer program product, including computer-readable codes, or a non-volatile computer-readable storage medium carrying computer-readable codes, where the computer-readable codes are stored in a processor of an electronic device In the case of running in the electronic device, the processor in the electronic device executes the above method.
  • Electronic devices may be provided as terminals, servers, or other forms of devices.
  • FIG. 6 shows a block diagram of an electronic device 800 according to an embodiment of the present disclosure.
  • the electronic device 800 may be a terminal such as a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, or a personal digital assistant.
  • electronic device 800 may include one or more of the following components: processing component 802, memory 804, power supply component 806, multimedia component 808, audio component 810, input/output (Input/Output, I/O) interface 812 , sensor component 814, and communication component 816.
  • the processing component 802 generally controls the overall operations of the electronic device 800, such as those associated with display, telephone calls, data communications, camera operations, and recording operations.
  • the processing component 802 may include one or more processors 820 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 802 may include one or more modules that facilitate interaction between processing component 802 and other components. For example, processing component 802 may include a multimedia module to facilitate interaction between multimedia component 808 and processing component 802 .
  • the memory 804 is configured to store various types of data to support operations at the electronic device 800 . Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and the like.
  • the memory 804 can be realized by any type of volatile or non-volatile storage device or their combination, such as Static Random-Access Memory (Static Random-Access Memory, SRAM), Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), Erasable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM), Programmable Read-Only Memory (Programmable Read-Only Memory, PROM), Read Only Memory (Read Only Memory, ROM), magnetic memory, flash memory, magnetic disk or optical disk.
  • Static Random-Access Memory SRAM
  • Electrically Erasable Programmable Read-Only Memory Electrically Erasable Programmable Read-Only Memory (Electrically Erasable Programmable
  • the power supply component 806 provides power to various components of the electronic device 800 .
  • Power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for electronic device 800 .
  • the multimedia component 808 includes a screen providing an output interface between the electronic device 800 and the user.
  • the screen may include a liquid crystal display (Liquid Crystal Display, LCD) and a touch panel (TouchPanel, TP).
  • the screen may be implemented as a touch screen to receive an input signal from a user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense a boundary of a touch or swipe action, but also detect duration and pressure associated with the touch or swipe action.
  • the multimedia component 808 includes a front camera and/or a rear camera. When the electronic device 800 is in an operation mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front camera and rear camera can be a fixed optical lens system or have focal length and optical zoom capability.
  • the audio component 810 is configured to output and/or input audio signals.
  • the audio component 810 includes a microphone (Micphone, MIC), which is configured to receive external audio signals when the electronic device 800 is in an operation mode, such as a calling mode, a recording mode and a voice recognition mode. Received audio signals may be stored in memory 804 or sent via communication component 816 .
  • the audio component 810 also includes a speaker configured to output audio signals.
  • the I/O interface 812 provides an interface between the processing component 802 and a peripheral interface module, which may be a keyboard, a click wheel, a button, and the like. These buttons may include, but are not limited to: a home button, volume buttons, start button, and lock button.
  • Sensor assembly 814 includes one or more sensors configured to provide various aspects of status assessment for electronic device 800 .
  • the sensor component 814 can detect the open/closed state of the electronic device 800, the relative positioning of components, such as the display and the keypad of the electronic device 800, the sensor component 814 can also detect the electronic device 800 or a Changes in position of components, presence or absence of user contact with electronic device 800 , electronic device 800 orientation or acceleration/deceleration and temperature changes in electronic device 800 .
  • Sensor assembly 814 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact.
  • Sensor assembly 814 may also include a light sensor, such as a Complementary Metal Oxide Semiconductor (CMOS) or Charge-coupled Device (CCD) image sensor, configured for use in imaging applications.
  • CMOS Complementary Metal Oxide Semiconductor
  • CCD Charge-coupled Device
  • the sensor component 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor or a temperature sensor.
  • the communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices.
  • the electronic device 800 can access a wireless network based on a communication standard, such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof.
  • a communication standard such as a wireless network (WiFi), a second generation mobile communication technology (2G) or a third generation mobile communication technology (3G), or a combination thereof.
  • the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component 816 also includes a Near Field Communication (NFC) module to facilitate short-range communication.
  • NFC Near Field Communication
  • the NFC module can be based on radio frequency identification (Radio Frequency Identification, RFID) technology, infrared data association (infrared data association, IrDA) technology, ultra-wideband (Ultra Wide Band, UWB) technology, Bluetooth (BlueTooth, BT) technology and other technology to achieve.
  • RFID Radio Frequency Identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth, BT
  • the electronic device 800 may be implemented by one or more application-specific integrated circuits (Application Specific Integrated Circuit, ASIC), digital signal processors (Digital Signal Processing, DSP), digital signal processing devices (Digital Signal Processing Device , DSPD), Programmable Logic Device (Programmable Logic Device, PLD), Field Programmable Gate Array (Field Programmable Gate Array, FPGA), controller, microcontroller, microprocessor or other electronic components to implement the above method.
  • ASIC Application Specific Integrated Circuit
  • DSP Digital Signal Processing
  • DSPD digital signal processing devices
  • PLD Programmable Logic Device
  • FPGA Field Programmable Gate Array
  • controller microcontroller, microprocessor or other electronic components to implement the above method.
  • a non-volatile computer-readable storage medium such as the memory 804 including computer program instructions, which can be executed by the processor 820 of the electronic device 800 to implement the above method.
  • FIG. 7 shows a block diagram of an electronic device 1900 according to an embodiment of the present disclosure.
  • electronic device 1900 may be provided as a server.
  • electronic device 1900 includes processing component 1922 , which includes one or more processors, and a memory resource represented by memory 1932 configured to store instructions executable by processing component 1922 , such as application programs.
  • the application programs stored in memory 1932 may include one or more modules each corresponding to a set of instructions.
  • the processing component 1922 is configured to execute instructions to perform the above method.
  • Electronic device 1900 may also include a power supply component 1926 configured to perform power management of electronic device 1900, a wired or wireless network interface 1950 configured to connect electronic device 1900 to a network, and an input-output (I/O) interface 1958 .
  • the electronic device 1900 can operate based on the operating system stored in the memory 1932, such as the Microsoft server operating system (Windows ServerTM), the graphical user interface-based operating system (Mac OS XTM) introduced by Apple Inc., and the multi-user and multi-process computer operating system (UnixTM). ), a free and open source Unix-like operating system (LinuxTM), an open source Unix-like operating system (FreeBSDTM), or similar.
  • a non-transitory computer-readable storage medium such as the memory 1932 including computer program instructions, which can be executed by the processing component 1922 of the electronic device 1900 to implement the above method.
  • the present disclosure can be a system, method and/or computer program product.
  • a computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present disclosure.
  • a computer readable storage medium may be a tangible device that can retain and store instructions for use by an instruction execution device.
  • a computer readable storage medium may be, for example, but is not limited to, an electrical storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing.
  • Computer-readable storage media include: portable computer disks, hard disks, Random Access Memory (RAM), Read Only Memory (ROM), Erasable Programmable Memory Read memory (EPROM or flash memory), static random access memory (SRAM), portable compact disk read-only memory (Compact Disc Read-Only Memory CD-ROM), digital versatile disk (Digital Video Disc, DVD), memory stick, Floppy disks, mechanically encoded devices, such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • RAM Random Access Memory
  • ROM Read Only Memory
  • EPROM or flash memory Erasable Programmable Memory Read memory
  • SRAM static random access memory
  • portable compact disk read-only memory Compact Disc Read-Only Memory CD-ROM
  • digital versatile disk Digital Video Disc, DVD
  • memory stick Floppy disks
  • mechanically encoded devices such as punched cards or raised structures in grooves with instructions stored thereon, and any suitable combination of the foregoing.
  • computer-readable storage media are not to be construed as transient signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through waveguides or other transmission media (e.g., pulses of light through fiber optic cables), or transmitted electrical signals.
  • Computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or downloaded to an external computer or external storage device over a network, such as the Internet, a local area network, a wide area network, and/or a wireless network.
  • the network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers, and/or edge servers.
  • a network adapter card or a network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in each computing/processing device .
  • Computer program instructions for performing the operations of the present disclosure may be assembly instructions, Instruction Set Architecture (ISA) instructions, machine instructions, machine-dependent instructions, microcode, firmware instructions, state setting data, or in the form of one or more source or object code written in any combination of programming languages, including object-oriented programming languages—such as Smalltalk, C++, etc., and conventional procedural programming languages—such as the “C” language or similar programming languages.
  • Computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server implement.
  • the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or it may be connected to an external computer such as use an Internet service provider to connect via the Internet).
  • LAN Local Area Network
  • WAN Wide Area Network
  • an Internet service provider to connect via the Internet.
  • computer-readable program instructions to personalize and customize electronic circuits, such as programmable logic circuits, field programmable gate arrays (FPGAs) or programmable logic arrays (Programmable Logic Array, PLA).
  • the electronic circuitry can execute computer readable program instructions to implement various aspects of the present disclosure.
  • These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine such that when executed by the processor of the computer or other programmable data processing apparatus , producing an apparatus for realizing the functions/actions specified in one or more blocks in the flowchart and/or block diagram.
  • These computer-readable program instructions can also be stored in a computer-readable storage medium, and these instructions cause computers, programmable data processing devices and/or other devices to work in a specific way, so that the computer-readable medium storing instructions includes An article of manufacture comprising instructions for implementing various aspects of the functions/acts specified in one or more blocks in flowcharts and/or block diagrams.
  • each block in a flowchart or block diagram may represent a module, a portion of a program segment, or an instruction that includes one or more Executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified function or action , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the computer program product can be specifically realized by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
  • a software development kit Software Development Kit, SDK
  • the embodiment of the present disclosure discloses a video frame insertion method, device, electronic equipment, storage medium, program and program product.
  • the method includes: obtaining the initial frame to be inserted corresponding to the video to be processed, and the first frame to be inserted corresponding to the initial frame to be inserted.
  • One event information, the first event information is used to represent the motion trajectory of the object in the initial frame to be inserted; feature extraction is performed on the initial frame to be inserted and the first event information respectively, and the initial frame feature map corresponding to the initial frame to be inserted and the first
  • the event feature map corresponding to the event information according to the initial frame feature map and the event feature map, the target frame to be inserted is generated; the target frame to be inserted is inserted into the video to be processed to obtain the processed video.
  • the embodiment of the present disclosure can improve the picture quality of the processed video.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Television Systems (AREA)
  • Image Analysis (AREA)

Abstract

本公开涉及一种视频插帧方法、装置、电子设备、存储介质、程序和程序产品,所述方法包括:获取待处理视频对应的初始待插帧,以及初始待插帧对应的第一事件信息,第一事件信息用于表征初始待插帧中物体的运动轨迹;分别对初始待插帧以及第一事件信息进行特征提取,得到初始待插帧对应的初始帧特征图以及第一事件信息对应的事件特征图;根据初始帧特征图与事件特征图,生成目标待插帧;将目标待插帧插入至待处理视频中,得到处理后视频。本公开实施例可实现提高处理后视频的画面质量。

Description

视频插帧方法、装置、电子设备、存储介质、程序及程序产品
相关申请的交叉引用
本公开要求2021年09月29日提交的中国专利申请号为202111154081.7、申请人为“深圳市慧鲤科技有限公司”,申请名称为“视频插帧方法及装置、电子设备和存储介质”的优先权,该申请的全文以引用的方式并入本申请中。
技术领域
本公开涉及计算机技术领域,尤其涉及一种视频插帧方法、装置、电子设备、存储介质、程序及程序产品。
背景技术
相关技术中,可以例如通过光流估计算法等视频插帧技术,对原始视频进行插帧,来提高原始视频的帧率。但通过相关视频插帧技术所生成的待插帧的图像质量不高,从而降低了插帧后视频的画面质量,例如使插帧后视频的画面产生抖动、扭曲等。
发明内容
本公开实施例至少提供一种视频插帧方法、装置、电子设备、存储介质、程序及程序产品。
根据本公开实施例的第一方面,提供了一种视频插帧方法,包括:获取待处理视频对应的初始待插帧,以及所述初始待插帧对应的第一事件信息,所述第一事件信息用于表征所述初始待插帧中物体的运动轨迹;分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图;根据所述初始帧特征图与所述事件特征图,生成目标待插帧;将所述目标待插帧插入至所述待处理视频中,得到处理后视频。通过该方式,能够提高处理后视频的画面质量,有利于降低处理后视频中画面的抖动与扭曲等。
根据本公开实施例的第二方面,提供了一种视频插帧装置,包括:获取模块,配置为获取待处理视频对应的初始待插帧,以及所述初始待插帧对应的第一事件信息,所述第一事件信息用于表征所述初始待插帧中物体的运动轨迹;特征提取模块,配置为分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图;生成模块,配置为根据所述初始帧特征图与所述事件特征图,生成目标待插帧;插帧模块,配置为将所述目标待插帧插入至所述待处理视频中,得到处理后视频。
根据本公开实施例的第三方面,提供了一种电子设备,包括:处理器;配置为存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行第一方面所述的方法。
根据本公开实施例的第四方面,提供了一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现第一方面所述的方法。
根据本公开实施例的第五方面,提供了一种计算机程序,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行第一方面所述的方法。
根据本公开实施例的第六方面,提供了一种计算机程序产品,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行第一方面所述的方法。
在本公开实施例中,能够实现利用表征初始待插帧中物体的运动轨迹的第一事件信息,对待处理视频的初始待插帧进行优化,使生成的目标待插帧的图像质量高于初始待插帧,从而提高处理后视频的画面质量,有利于降低处理后视频中画面的抖动与扭曲等。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,而非限制本公开。根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。
附图说明
此处的附图被并入说明书中并构成本说明书的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。
图1示出根据本公开实施例的视频插帧方法的流程图。
图2示出根据本公开实施例的融合特征图生成流程的示意图。
图3示出根据本公开实施例的原始帧事件特征图的示意图。
图4示出根据本公开实施的一种图像处理网络的示意图。
图5示出根据本公开实施例的视频插帧装置的框图。
图6示出根据本公开实施例的一种电子设备的框图。
图7示出根据本公开实施例的一种电子设备的框图。
具体实施方式
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
另外,为了更好地说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。
图1示出根据本公开实施例的视频插帧方法的流程图,所述视频插帧方法可以由终端设备或服务器等电子设备执行,终端设备可以为用户设备(User Equipment,UE)、移动设备、用户终端、终端、蜂窝电话、无绳电话、个人数字助理(Personal Digital Assistant,PDA)、手持设备、计算设备、车载设备、可穿戴设备等,所述方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现,或者,可通过服务器执行所述方法。如图1所示,所述视频插帧方法包括:
在步骤S11中,获取待处理视频对应的初始待插帧,以及初始待插帧对应的第一事件信息,第一事件信息用于表征初始待插帧中物体的运动轨迹。
其中,待处理视频可理解为待插入视频帧的低帧率视频。在一种可能的实现方式中,可以是通过本领域已知的光流估计算法,例如PWCNet算法、FlowNet算法等,计算待处理视频中任意两帧原始视频帧到插帧时刻的光流,并根据光流将原始视频帧通过前向渲染(也即前向映射)等方式,渲染得到初始待插帧。应理解的是,待处理视频帧中任意两帧原始视频帧中可以插入至少一帧初始待插帧,对于初始待插帧的数量以及生成方式,本公开实施例不作限制。
其中,第一事件信息可以是根据事件相机采集的事件信号确定的。事件相机的基本原理可以简单理解为:在某个采集点的亮度变化累计达到一定亮度阈值的情况下,输出一个事件信号,其中,该亮度阈值是事件相机的固有参数,事件信号可以表征事件相机所拍摄物体上亮度发生变化的采集点的亮度变化程度。
应理解的是,在事件相机所拍摄场景中的物体运动或光照改变造成亮度变化的情况下,事件相机会产生一系列微秒级的事件信号,这些事件信号可以事件流的方式输出,基于此,根据事件相机采集的事件流,可以得到任意秒级时刻下表征物体的运动轨迹的事件信息。
在一种可能的实现方式中,例如可以是将初始待插帧对应的插帧时刻处的事件信号进行累加,得到初始待插帧对应的第一事件信息,那么第一事件信息也即可以表征插帧时刻处物体的运动轨迹,第一事件信息可以采用“图”的形式记录上述插帧时刻处事件信号的累加值,通过该方式,可以之后便于提取第一事件信息中的事件特征图。
考虑到,为了便于得到初始待插帧对应的第一事件信息,待处理视频也可以是由事件相机采集的,也即,事件相机可以同时采集事件信号与视频信号,事件信号以事件流的形式输出,视频信号以视频流的形式输出。当然待处理视频也可以是其它类型相机(如单目相机)采集的,其它类型相机与事件相机可以同步对同一场景进行信号采集,对此本公开实施例不作限制。
在步骤S12中,分别对初始待插帧以及第一事件信息进行特征提取,得到初始待插帧对应的初始帧特征图以及第一事件信息对应的事件特征图。
在一种可能的实现方式中,可以采用本领域已知的特征提取网络,例如,可以采用Unet网络、AlexNet网络等,分别对初始待插帧进行特征提取,得到初始待插帧对应的初始帧特征图,以及对第一事件信号进行特征提取,得到第一事件信息对应的事件特征图。应理解的是,对于采用何种特征提取网络,本公开实施例不作限制。
在步骤S13中,根据初始帧特征图与事件特征图,生成目标待插帧。
可理解的是,通过步骤S12提取的初始帧特征图与事件特征图可以是多尺度的特征图,在一种 可能的实现方式中,根据初始帧特征图与事件特征图,生成目标待插帧,可以包括:通过本领域已知的多尺度特征融合网络(例如特征金字塔网络),对初始帧特征图与事件特征图进行多尺度特征融合,得到融合后特征图;进而通过解码网络对融合后特征图进行解码处理,得到目标待插帧。
其中,可理解的是,解码网络与上述特征提取网络的网络结构对应,上述特征提取网络也可以称为编码网络。通过该方式生成的目标待插帧,能够将事件特征图中表征物体运动轨迹的特征信息融合至初始帧特征图中,能够使生成的目标待插帧中物体显示的更清晰和更稳定,也即提高目标待插帧的图像质量。
在步骤S14中,将目标待插帧插入至待处理视频中,得到处理后视频。
在一种可能的实现方式中,将目标待插帧插入至待处理视频中,得到处理后视频,可以包括:根据初始待插帧对应的插帧时刻,将目标待插帧插入至待处理视频中,得到处理后视频,其中,处理后视频的帧率高于待处理视频,也即处理后视频可理解为高帧率视频。应理解的是,可以采用本领域已知计算机视觉技术,实现将目标待插帧插入至待处理视频中,对此本公开实施例不作限制。
在本公开实施例中,能够实现利用表征初始待插帧中物体的运动轨迹的第一事件信息,对待处理视频的初始待插帧进行优化,使生成的目标待插帧的图像质量高于初始待插帧,从而提高处理后视频的画面质量,有利于降低处理后视频中画面的抖动与扭曲等。
考虑到,通过上述本公开实施例中对初始待插帧以及第一事件信息进行特征提取以及多尺度特征融合的方式,可能使生成的目标待插帧中丢失原始视频帧中物体的部分细节信息,在一种可能的实现方式中,在步骤S13中,根据初始帧特征图与事件特征图,生成目标待插帧,包括:
步骤S131:根据初始帧特征图与事件特征图,生成预估待插帧;
如上所述,初始帧特征图与事件特征图可以是多尺度的,在一种可能的实现方式中,可以参照上述本公开实施例步骤S13中的相关记载,通过多尺度特征融合网络,对初始帧特征图与事件特征图进行多尺度特征融合,得到融合后特征图;进而通过解码网络对融合后特征图进行解码处理,得到预估待插帧。
步骤S132:根据待处理视频中、与初始待插帧的插帧时刻相邻的原始视频帧,以及原始视频帧对应的第二事件信息,对预估待插帧进行优化,得到目标待插帧,第二事件信息用于表征原始视频帧中物体的运动轨迹。
其中,待处理视频中与初始待插帧的插帧时刻相邻的原始视频帧,可以理解为,待处理视频中与插帧时刻在时序上相邻的原始视频帧。在一种可能的实现方式中,可以参照上述本公开实施例中第一事件信息的确定方式,得到原始视频帧对应的第二事件信息,也即,可以是将原始视频帧对应的采集时刻处的事件信号进行累加,得到原始视频帧对应的第二事件信息,那么第二事件信息也即可以表征原始视频帧对应的采集时刻处物体的运动轨迹。
在一种可能的实现方式中,根据上述原始视频帧以及第二事件信息,对预估待插帧进行优化,得到目标待插帧,例如可以包括:基于注意力机制,利用残差网络对原始视频帧与第二事件信息的组合信息进行残差特征提取,得到残差细节图,将残差细节图与预估待插帧的进行图像融合,得到目标待插帧。
在本公开实施例中,能够提取原始视频帧中物体的细节信息,将物体的细节信息融合至预估待插帧中,从而增强预估待插帧的图像质量,也即使目标待插帧具有更高的图像质量。
如上所述,初始帧特征图与事件特征图可以是多尺度的,在一种可能的实现方式中,初始帧特征图包括S个尺度,事件特征图包括S个尺度,S为正整数,s∈[1,S),其中,在步骤S131中,根据初始帧特征图与事件特征图,生成预估待插帧,包括:
步骤S1311:根据第0尺度的初始帧特征图与第0尺度的事件特征图,得到第0尺度的融合特征图。
其中,第0尺度的初始帧特征图与第0尺度的事件特征图,可以理解为,分别是初始帧特征图与事件特征图中最低尺度或者说最小尺寸、最小分辨率的特征图。
在一种可能的实现方式中,根据第0尺度的初始帧特征图与第0尺度的事件特征图,得到第0尺度的融合特征图,可以包括:将第0尺度的初始帧特征图与第0尺度的事件特征图进行通道拼接,得到拼接特征图;对拼接特征图进行滤波处理,得到第0尺度的融合特征图。通过该方式,可以便捷、有效地得到第0尺度的融合特征图。
其中,通道拼接,可以理解为,在特征图的通道维度上进行拼接,例如,128通道、16×16尺寸的两个特征图,通过通道拼接可以得到256通道、16×16尺寸的特征图。
在一种可能的实现方式中,可以通过卷积核为1×1尺寸的卷积层,对拼接特征图进行滤波处理, 得到第0尺度的融合特征图,其中,卷积层中卷积核的数量与第0尺度的初始帧特征图的通道数相同。
应理解的是,第0尺度的融合特征图的尺寸以及通道数,与第0尺度的事件特征图以及第0尺度的初始帧特征图相同,举例来说,假设拼接特征图为256通道、16×16尺寸的特征图,通过卷积核为128个1×1尺寸的卷积层,对该拼接特征图进行滤波处理,可以得到128通道、16×16尺寸的第0尺度的融合特征图。
步骤S1312:根据第(s-1)尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图。
考虑到,初始帧特征图与事件特征图可以理解为对物体的不同视角表达,或者说,初始帧特征图与事件特征图的特征空间不同,为便于将初始帧特征图与事件特征图进行特征融合,可以将初始帧特征图与事件特征图转换至同一特征空间中,也即将初始帧特征图与事件特征图进行空间对齐。
其中,根据第s-1尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,可以理解为,将初始帧特征图与事件特征图转换至融合特征图对应的特征空间中,这样得到的第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图可以在同一特征空间中进行特征融合。
在一种可能的实现方式中,可以利用本领域已知的自适应实例归一化(Adaptive Instance Normalization)思想,将不同视角表达的特征图在同一空间对齐,也即,实现根据第s-1尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐。
步骤S1313:根据第(s-1)尺度的融合特征图、第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图,得到第s尺度的融合特征图。
在一种可能的实现方式中,根据第s-1尺度的融合特征图、第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图,可以包括:对第s-1尺度的融合特征图进行上采样,得到上采样特征图,其中,上采样特征图与第s尺度的初始帧特征图以及第s尺度的事件特征图的尺寸相同;将上采样特征图与第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图三者之间特征融合,得到第s尺度的融合特征图。
其中,可采用本领域已知的特征融合方式,实现上述三个特征图之间的特征融合,例如,可采用将三个特征图相加(add)、通道数不变的方式,或三个特征图在通道维度合并(concat)、通道数增加的方式,对此本公开实施例不作限制。
应理解的是,上述步骤S1312至步骤S1313可以理解为递归式的特征融合过程,其中,除第0尺度的融合特征图以外的、各个尺度的融合特征图的递归融合过程,可以表示为公式(1),
X s=g(X s-1;f s,e s)        (1)
其中,X s-1表示第s-1尺度的融合特征图,f s表示第s尺度的初始帧特征图,e s表示第s尺度的事件特征图,g(X s-1;f s,e s)表示上述步骤S1312至步骤S1313中空间对齐以及特征融合过程。
步骤S1314:对第(S-1)尺度的融合特征图进行解码处理,得到预估待插帧。
如上文所述,可以通过解码网络对融合后特征图进行解码处理,得到预估待插帧,其中,解码网络与上述特征提取网络的网络结构对应,上述特征提取网络也可以称为编码网络。应理解的是,第S-1尺度的融合特征图,可以理解为最后一次特征融合后得到融合特征图,也即为上述融合后特征图,基于此,可以通过解码网络对第S-1尺度的特征图进行解码处理,得到预估待插帧。
在一种可能的实现方式中,可以按照上述步骤S1311至步骤S1314的实现方式,直接根据初始帧特征图与事件特征图,生成目标待插帧,也即,可以直接将预估待插帧作为目标待插帧。应理解的是,预估待插帧的图像质量已高于初始待插帧,在预估待插帧的图像质量已满足用户的画质需求时,可以直接将预估待插帧作为目标待插帧,插入至待处理视频帧中,通过该方式,可以快速得到画面清晰稳定的待处理后视频。
在本公开实施例中,能够有效实现初始帧特征图与事件特征图之间的多尺度自适应特征融合,从而有效得到预估待插帧。
如上所述,可以利用本领域已知的自适应实例归一化思想,将不同视角表达的特征图在同一空间对齐,基于此,在一种可能的实现方式中,在步骤S1312中,根据第(s-1)尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图,包括:
对第(s-1)尺度的融合特征图进行上采样,得到上采样特征图,上采样特征图与第s尺度的初始帧特征图以及第s尺度的事件特征图的尺寸相同;
根据上采样特征图与第s尺度的初始帧特征图之间的第一空间转换关系,得到第s尺度的可融合初始帧特征图,其中,第一空间转换关系是根据第s尺度的初始帧特征图在空间转换时的第一像素尺寸缩放信息与第一偏置信息,以及上采样特征图的特征信息确定的;
根据上采样特征图与第s尺度的事件特征图之间的第二空间转换关系,得到第s尺度的可融合事件特征图,其中,第二空间转换关系是根据第s尺度的事件特征图在空间转换时的第二像素尺寸缩放信息与第二偏置信息,以及上采样特征图的特征信息确定的;
其中,第s尺度的可融合初始帧特征图、第s尺度的可融合事件特征图与上采样特征图处于同一特征空间中,像素尺寸缩放信息表示空间转换中每个像素点的尺寸缩放比例,偏置信息表示空间转换中每个像素点的位置偏移量。
在一种可能的实现方式中,第一空间转换关系可以表示为公式(2-1),第二空间装换关系可以表示为公式(2-2)
Figure PCTCN2022079310-appb-000001
Figure PCTCN2022079310-appb-000002
其中,
Figure PCTCN2022079310-appb-000003
表示对第s-1尺度的融合特征图进行上采样得到的上采样特征图,其中μ(·)和
Figure PCTCN2022079310-appb-000004
分别表示随机变量
Figure PCTCN2022079310-appb-000005
在空间维度上的均值和方差值,算子⊙表示哈达玛积,c f表示第一像素尺寸缩放信息,b f表示第一偏置信息,y f表示第s尺度的可融合初始帧特征图,c e表示第二像素尺寸缩放信息,b e表示第二偏置信息,y e表示第s尺度的可融合事件特征图。
在一种可能的实现方式中,像素尺寸可以理解为像素级的尺寸,或者说,每个像素点在特征图中占据的尺寸,其中,尺寸缩放比例包括尺寸放大比例,或尺寸缩小比例。应理解的是,在进行空间转换时,每个像素点的像素尺寸可能增大(或者说增强),也可能缩小(或者说减弱),每个像素点的位置可能发生偏移,基于此,可以根据像素尺寸缩放比例以及位置偏移量,将不同特征空间中的特征图进行空间对齐,也即将不同特征空间中的特征图转换至同一特征空间中。
在一种可能的实现方式中,对于f s和e s两个变量,可以分别用两组独立的卷积层去学习空间转换时各自对应的c f,b f以及c e,b e,通过这种空间转换,公式(2-1)与(2-2)相当于是通过事件相机采集的两种信号(视频信号与事件信号)归纳的信息来重写
Figure PCTCN2022079310-appb-000006
在本公开实施例中,能够有效利用第一空间转换关系与第二空间转换关系,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到可以进行特征融合的、第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图。
可知晓的是,事件信号对于运动物体的边界有良好的感知能力,因为这种运动常常会造成物体上采集点的亮度变化,而且基于纯视频信号的光流运动估计算法,在这种对于运动物体的运动估计值往往是不可靠的,但是对于纹理简单的静止区域,事件相机的感知能力会减弱,其捕捉到的事件信息的可靠程度可能不如从视频信号提取的视频信息,也即事件信息与视频信息是互补的信息。
基于此,为实现自适应融合上述两种互补的信息,也即自适应融合第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图。在一种可能的实现方式中,在步骤S1313中,根据第s-1尺度的融合特征图、第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图,得到第s尺度的融合特征图,包括:
步骤S13131:对上采样特征图进行卷积处理以及非线性处理,得到上采样特征图对应的掩码图,其中,上采样特征图是对第(s-1)尺度的融合特征图进行上采样得到的;
在一种可能的实现方式中,可以通过卷积层和激活函数(如sigmoid)层对上采样特征图进行卷积处理以及非线性处理,得到上采样特征图对应的掩码图。其中,掩码图可以表征上采样特征图中每个像素点是否为运动物体上的像素点。应理解的是,对于上述卷积层中的卷积核尺寸与数量、以及激活函数层采用的激活函数类型,本公开实施例不作限制。
在一种可能的实现方式中,掩码图可以通过二值掩码(也即0和1)的形式记录,也即例如可以用“0”表征是运动物体上的像素点,用“1”表征不是运动物体上的像素点,对此本公开实施例不作限制。
步骤S13132:根据掩码图,将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合,得到第s尺度的融合特征图。
在一种可能的实现方式中,可以通过公式(3)实现根据掩码图,将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合,得到第s尺度的融合特征图,
y=y e⊙m+y f(1-m)         (3)
其中,m代表掩码图,1-m代表反向掩码图,y e代表第s尺度的可融合事件特征图,y f代表第s尺度的可融合初始帧特征图,y可以代表第s尺度的融合特征图X s。如上所述,掩码图m可以是基于二值掩码的形式记录的,反向掩码图可以表示为1-m。
图2示出根据本公开实施例的融合特征图生成流程的示意图,为便于理解本公开实施例步骤S13131至步骤S13132生成融合特征图的实现方式,结合图2示出的生成流程进行说明,如图2所示,对第s-1尺度的融合特征图X s-1进行上采样以及实例归一化(instance normalization)得到上采样特征图
Figure PCTCN2022079310-appb-000007
上采样特征图
Figure PCTCN2022079310-appb-000008
输入至卷积核为1×1尺寸的卷积层(1×1Conv)和激活函数(如sigmoid)层,得到掩码图(m)与反向掩码图(1-m),对初始帧特征图f s和事件特征图e s,可以分别用两组独立的卷积层去学习空间转换时各自对应的c f,b f以及c e,b e,利用上述公式(2-1)、公式(2-2)以及公式(3),得到第s尺度的融合特征图X s
在本公开实施例中,能够有效在上采样特征图对应的掩码图的指导下,自适应地将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合。
考虑到,仅通过上述公式(3)生成各个尺度的融合特征图的过程,实际上是线性的仿射变换过程,为增加融合特征图的非线性或者说为增加融合特征图的复杂度,在一种可能的实现方式中,在步骤S13132中,根据掩码图,将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合,得到第s尺度的融合特征图,包括:
根据掩码图,将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图;对第s尺度的初始融合特征图进行卷积处理以及非线性处理,得到第s尺度的融合特征图。通过该方式,能够有效在上采样特征图对应的掩码图的指导下,自适应地将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合。
其中,可以参照公式(3)示出的实现方式,实现根据掩码图,将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图,也即上述公式(3)中的y也可以代表第s尺度的初始融合特征图。
基于上述公式(3),在一种可能的实现方式中,根据掩码图,将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图,可以包括:计算掩码图与第s尺度的可融合事件特征图之间的哈达玛积;根据掩码图对应的反向掩码图,计算反向掩码图与第s尺度的可融合初始帧特征图之间的乘积;将哈达玛积与乘积相加,得到第s尺度的初始融合特征图。通过该方式,能够有效增加融合特征图的非线性或者说为增加融合特征图的复杂度,便于实现多尺度的特征融合。能够根据在掩码图与反向掩码图的指导下,自适应地将第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图进行特征融合。
在一种可能的实现方式中,例如可以通过卷积核为3x3尺寸的卷积层和激活函数(如LeakyRelu)层,对第s尺度的初始融合特征图进行卷积处理以及非线性处理,得到第s尺度的融合特征图。应理解的是,对于上述卷积层中的卷积核尺寸与数量、以及激活函数层采用的激活函数类型,本公开实施例不作限制。
在本公开实施例中,能够有效增加融合特征图的非线性或者说为增加融合特征图的复杂度,便于实现多尺度的特征融合。
如上所述,可以利用原始视频帧中物体的图像细节结合原始视频帧中物体的运动轨迹,将物体的细节信息融合至预估待插帧中,从而增强预估待插帧的图像质量。在一种可能的实现方式中,在步骤S132中,根据待处理视频中、与初始待插帧的插帧时刻相邻的原始视频帧,以及原始视频帧对应的第二事件信息,对预估待插帧进行优化,得到目标待插帧,包括:
步骤S1321:将预估待插帧与第一事件信息进行组合,得到预估帧事件组合信息。
如上所述,第一事件信息可以表征初始待插帧对应的插帧时刻处物体的运动轨迹,预估待插帧是根据初始待插帧的初始帧特征图与第一事件信息的事件特征图生成,第一事件信息可以采用“图”的形式记录初始待插帧对应的插帧时刻处事件信号的累加值。应理解的是,预估帧事件组合信息中包括预估待插帧与第一事件信息。
步骤S1322:将原始视频帧与第二事件信息进行组合,得到原始帧事件组合信息。
如上所述,第二事件信息可以表征原始视频帧对应的采集时刻处物体的运动轨迹,第二事件信息可以采用“图”的形式记录原始视频帧对应的采集时刻处事件信号的累加值。应理解的是,原始帧事件组合信息中包括预估待插帧与第二事件信息。
步骤S1323:分别对预估帧事件组合信息与原始帧事件组合信息进行特征提取,得到预估帧事 件组合信息对应的预估帧事件特征图以及原始帧事件组合信息对应的原始帧事件特征图。
在一种可能的实现方式中,例如可以采用参数共享的多层卷积层,分别对预估帧事件组合信息与原始帧事件组合信息进行特征提取,得到预估帧事件组合信息对应的预估帧事件特征图以及原始帧事件组合信息对应的原始帧事件特征图。
举例来说,可以将预估帧事件组合信息,输入至3层卷积层中,输出预估帧事件特征图;将原始帧事件组合信息,输入至该3层卷积层中,输出原始帧事件特征图。其中,考虑到原始视频帧可以是至少一帧,原始帧事件组合信息可以是至少一个,那么原始帧事件特征图可以是至少一个。应理解的是,可以采用本领域已知的特征提取方式,提取上述预估帧事件特征图以及原始帧事件特征图,对此本公开实施例不作限制。
步骤S1324:根据预估帧事件特征图,对原始帧事件特征图进行调整,得到整合特征图。
在一种可能的实现方式中,可以利用注意力机制,从原始帧事件特征图中找到与预估帧事件特征图中的每个像素点相匹配的匹配像素点,或者说,从原始帧事件特征图中找到与预估帧事件特征图中的每个像素点相似度最大的匹配像素点;进而以原始帧事件特征图中每个匹配像素点的像素位置为中心,从原始帧事件特征图上裁切出多个指定尺寸的特征图块,根据每个匹配像素点的像素位置,对多个指定尺寸的特征图块进行尺寸拼接,得到整合特征图。
其中,尺寸拼接,可以理解为,在特征图的长宽维度上进行拼接,使整合特征图的尺寸与原始帧事件特征图的尺寸相同。例如,4个2×2尺寸的特征图块进行尺寸拼接,可以得到一个4×4尺寸的整合特征图。
步骤S1325:根据整合特征图、预估帧事件特征图以及融合特征图,对预估待插帧进行优化,得到目标待插帧,融合特征图是对初始帧特征图与事件特征图进行多尺度融合得到的。
其中,融合特征图可以是通过上述本公开实施例中步骤S1311至步骤S1313对初始帧特征图与事件特征图进行多尺度融合得到的,对于融合特征图的确定过程,在此不做赘述。以及,如上所述,融合特征图可以是多尺度的,整合特征图也可以是多尺度的。
如上所述,可以采用多层卷积层,分别对预估帧事件组合信息与原始帧事件组合信息进行特征提取,那么预估帧事件特征图与原始帧事件特征图可以是多尺度的特征图,基于此,整合特征图可以是多尺度的。
在一种可能的实现方式中,根据整合特征图、预估帧事件特征图以及融合特征图,对预估待插帧进行优化,得到目标待插帧,可以包括:对整合特征图、预估帧事件特征图以及融合特征图进行多尺度融合,得到目标融合特征图;通过残差网络提取目标融合特征图中的残差特征,并通过指定解码网络对残差特征进行解码处理,得到残差特征对应的残差信息;将残差信息叠加至预估待插帧中,得到目标待插帧。
其中,可以参照上述本公开实施例中步骤S1311至步骤S1313的方式,实现对整合特征图、预估帧事件特征图以及融合特征图进行多尺度融合,得到目标融合特征图,在此不做赘述。
其中,指定解码网络的网络结构可以与上述提取原始帧事件特征图以及预估帧事件特征图所用的多层卷积层对应,也即上述多层卷积层可理解为编码网络。在一种可能的实现方式中,残差信息也可以采用“图”的形式,将参数信息叠加至预估待插帧中,可以理解为,将残差信息与预估待插帧进行图像融合。
在本公开实施例中,能够将整合特征图、预估帧事件特征图与融合特征图进行融合,并提取目标融合特征图中表征图像细节的残差信息,进而将预估待插帧与残差信息进行叠加所得到的目标待插帧的图像质量更高。
如上所述,可以采用多层卷积层,分别对预估帧事件组合信息与原始帧事件组合信息进行特征提取,那么预估帧事件特征图与原始帧事件特征图可以是多尺度的特征图。
在一种可能的实现方式中,预估帧事件特征图包括S *个尺度,原始帧事件特征图包括S *个尺度,1≤S *≤S,S *为正整数,s *∈[S-S *,S),第(S-S *)尺度的预估帧事件特征图的尺寸为I×I,I为正整数,其中,在步骤S1324中,根据预估帧事件特征图,对原始帧事件特征图进行调整,得到整合特征图,包括:
步骤S13241:针对第(S-S *)尺度的预估帧事件特征图中的任一个第一像素点,从第(S-S *)尺度的原始帧事件特征图中确定出与第一像素点匹配的第一匹配像素点。
其中,与第一像素点匹配的第一匹配像素点,可以理解为,与第一像素点相似度最大的第一匹配特征图像。在一种可能的实现方式中,针对第S-S *尺度的预估帧事件特征图中的任一个第一像素点,从第S-S *尺度的原始帧事件特征图中确定出与第一像素点匹配的第一匹配像素点,包括:
针对任一个第一像素点,计算第一像素点分别与第S-S *尺度的原始帧事件特征图中、在指定窗口内的各个像素点之间的特征相似度,指定窗口是根据第一像素点的像素位置确定的;将指定窗口内的各个像素点中的、最大特征相似度所对应的像素点,确定为第一匹配像素点。通过该方式,可以高效地确定出与各第一像素点匹配的第一匹配像素点。
在一种可能的实现方式中,指定窗口例如可以是每个第一像素点的像素位置为中心、周围的(2m+1) 2大小的局部窗口,m可以根据实际需求设置,例如可以是设置为3,对此本公开实施例不作限制。通过该方式,能够缩小在原始帧事件特征图中检索第一匹配像素点的范围,减少运算量,进行提高确定第一匹配像素点的效率。
在一种可能的实现方式中,例如可以采用欧式距离(又称欧几里得距离)、余弦距离等方式计算像素点之间的特征相似度;将指定窗口内的各个像素点中的、最大特征相似度所对应的像素点,确定为第一匹配像素点,可以理解为,第一匹配像素点是指定窗口内的各个像素点中、欧式距离或余弦距离最小的像素点。
在一种可能的实现方式中,公式(4)示出基于根据本公开实施例的一种采用欧式距离,确定特征相似度的实现方式。
Figure PCTCN2022079310-appb-000009
其中,i代表第S-S *尺度的预估帧事件特征图中的任一个第一像素点的像素位置,p代表指定窗口内给定的整数偏移量,p∈[-m,m] 2,i+p代表原始帧事件特征图中指定窗口内各个像素点的像素位置,
Figure PCTCN2022079310-appb-000010
代表预估帧事件特征图上第一像素点对应的特征值,k 0(i+p)代表原始帧事件特征图中指定窗口内的各个像素点的特征值,‖·‖ 2表示2-范数,D(i,p)代表第一像素点分别与指定窗口内的各个像素点之间的欧氏距离。
应理解的是,对于预估帧事件特征图中上的每个第一像素点,通过公式(4)均可以得到每个第一像素点与指定窗口内各个像素点之间的欧式距离,其中,距离越小,代表特征相似度越高。基于此,在原始帧事件特征图的指定窗口内可以找到距离最小的像素位置,也即最匹配的像素点的像素位置j,即j=i+p *,其中p *=argmin pD(i,p),p *可理解为使D(i,p)最小的p;或者说,行向量元素{D(i,p)|p∈[-m,m] 2}可以被组织成(2m+1) 2个“查询”向量和“键”向量之间的距离,其中j=i+p *是最小距离所在的像素位置,k 0(j)可以理解为与第一像素点
Figure PCTCN2022079310-appb-000011
相匹配的第一匹配像素点。
步骤S13242:根据第一匹配像素点的像素位置以及指定偏移量,确定与像素位置对应的亚像素位置,指定偏移量为小数。
在一种可能的实现方式中,可以以第一匹配像素点的像素位置j为中心,构建一个局部距离场,该局部距离场可以被一个参数化的二阶多项式进行连续拟合,而这个二阶多项式的全局极小值是有闭合解的,通过将二阶多项式的连续拟合融入到神经网络训练过程中,可以调整局部距离场的形状,也即调整二阶多项式的参数,从而得到估计的指定偏移量。考虑到行文简洁,本公开实施例将在下文详细阐述该指定偏移量的确定方式。
其中,根据第一匹配像素点的像素位置以及指定偏移量,确定与像素位置对应的亚像素位置,可以包括:将像素位置与指定偏移量相加,得到亚像素位置,其中,由于指定偏移量为小数,从而可以得到精度更高的、非整数位置上的亚像素位置。
步骤S13243:根据I×I个亚像素位置,对第s *尺度的原始帧事件特征图进行调整,得到第s *尺度的整合特征图。
如上所述,第S-S *尺度的预估帧事件特征图的尺寸为I×I,也即,第S-S *尺度的预估帧事件特征图上有I×I个第一像素点,针对每个第一像素点均可以按照上述步骤S13241至步骤S13242得到亚像素位置,也即可以得到I×I个亚像素位置。
可理解的是,第s *尺度的原始帧事件特征图的尺寸是第S-S *尺度的预估帧事件特征图的n倍,I×I个亚像素位置是基于第S-S *尺度的预估帧事件特征图,也即,是基于最小尺度的预估帧事件特征图确定的,若要根据I×I个亚像素位置,对第s *尺度的原始帧事件特征图进行调整,可以是根据I×I个亚像素位置,对第s *尺度的原始帧事件特征图进行裁切,得到I×I个、n×n尺寸的特征图块,并对I×I个、n×n尺寸的特征图块进行尺寸拼接,得到第s *尺度的整合特征图。
在一种可能的实现方式中,在步骤S13243中,根据I×I个亚像素位置,对第s *尺度的原始帧事件特征图进行调整,得到第s *尺度的整合特征图,包括:
以每一个亚像素位置为中心,从第s *尺度的原始帧事件特征图上裁切出I×I个、n×n尺寸的特 征图块;根据I×I个亚像素位置,对I×I个、n×n尺寸的特征图块进行尺寸拼接,得到第s *尺度的整合特征图,第s *尺度的整合特征图与第s *尺度的原始帧事件特征图的尺寸相同。通过该方式,可以使第s *尺度的整合特征图中包含了关注度更高的特征信息。
考虑到,各个特征图块上的各个位置均为非整数的坐标位置,在一种可能的实现方式中,可以通过线性插值(例如双线性插值)的方式,得到各个特征图块上各个位置处的特征值。
举例来说,图3示出根据本公开实施例的原始帧事件特征图的示意图,如图3所示,j代表一个亚像素位置,假设n为2,也即,针对亚像素位置j裁切出2×2尺寸的特征图块H j,例如针对特征图块H j上亚像素位置h1的特征值,可以对该亚像素位置h1周围的两个像素位置“a6、a7”上的特征值(或四个像素位置“a1、a2、a6、a7”上的特征值)进行双线性插值,得到该亚像素位置h1处对应的特征值,其中,对于其它h2、h3与h4处的特征值,均可以对各自周围的像素位置上的特征值进行双线性插值得到各自对应的特征值。
应理解的是,针对每个特征图块,均可以利用各特征图块上每个位置处周围的至少两个像素位置上的特征值,对至少两个像素位置上的特征值进行双线性插值得到各特征图块上每个位置处的特征值。
其中,根据I×I个亚像素位置,对I×I个、n×n尺寸的特征图块进行尺寸拼接,可以理解为,根据I×I个亚像素位置在尺寸维度(也即长宽维度)上拼接I×I个、n×n尺寸的特征图块,使第s *尺度的整合特征图的尺寸与第s *尺度的原始帧事件特征图相同。
在本公开实施例中,相当于利用注意力机制找到与每个第一像素点对应的亚像素位置,并基于亚像素位置得到整合特征图,也即整合特征图是结合注意力机制的特征图,从而使得整合特征图中包含了关注度更高的特征信息。
如上所述,可以以第一匹配像素点的像素位置j为中心,构建一个局部距离场,该局部距离场可以被一个参数化的二阶多项式进行连续拟合,而这个二阶多项式的全局极小值是有闭合解的,通过将二阶多项式的连续拟合融入到神经网络训练过程中,可以调整局部距离场的形状,也即调整二阶多项式的参数,从而得到估计的指定偏移量。
在一种可能的实现方式中,在步骤S13242中,根据述第一匹配像素点的像素位置以及指定偏移量,确定与像素位置对应的亚像素位置,包括:
根据像素位置、预设的偏移参数以及预设的曲面参数,确定目标函数;其中,目标函数是根据曲面函数与距离函数之间的差异构建的,距离函数是根据像素位置与偏移参数构建的,曲面函数是根据曲面参数与偏移参数构建的。根据偏移参数对应的预设取值区间,对目标函数进行最小化求解,得到曲面参数的参数值,其中偏移参数为目标函数中的自变量;根据曲面参数的参数值,确定指定偏移量;将像素位置与指定偏移量相加,得到亚像素位置。通过该方式,可以准确有效地确定出亚像素位置。
在一种可能的实现方式中,距离函数d(u)可以表示为公式(5),也即上述局部距离场,曲面函数
Figure PCTCN2022079310-appb-000012
可以表示为公式(6),也即上述二阶多项式,目标函数可以表示为公式(7)。
d(u)=D(i,p *+u),u∈[-n,n] 2    (5)
其中,D()代表欧式距离可参照上述公式(4),u代表偏移参数,[-n,n] 2代表预设取值区间,n的值可以根据实际需求设置,例如可以设置为1,对此本公开实施例不作限制。在一种可能的实现方式中,预设取值区间可以是以亚像素位置j为中心采样一个大小为(2n+1) 2的局部窗口,也即得到该预设取值区间[-n,n] 2,或者说,作为自变量的偏移参数从该(2n+1) 2的局部窗口内取值来求解目标函数。
Figure PCTCN2022079310-appb-000013
其中,A、b和c代表曲面参数。在一种可能的实现方式中,A可以是一个2×2的正定矩阵,b是一个2×1的向量,而c是一个偏置常数,u T代表u的转置,b T代表b的转置。应理解的是,由于通常用横坐标与纵坐标表征图像上像素点的位置,偏移参数可以是2×1的向量,也即偏移参数可以包括横轴上的偏移参数与纵轴上的偏移参数。
应理解的是,上述公式(5)与(6)中的各个约束条件,可以是使得公式(6)为一个具有全局极小值点的二次曲面函数。为了估计未知曲面参数A,b和c的参数值,可以采用加权最小二乘法,根据(2n+1) 2个已知的自变量u和其对应的距离函数值d(u),通过最小化目标函数(7)的方式,求解得到曲面参数的参数值。
Figure PCTCN2022079310-appb-000014
其中,w(u)代表高斯分布函数
Figure PCTCN2022079310-appb-000015
其中,σ为常数参数,exp代表以自然常数e为底的指数函数,
Figure PCTCN2022079310-appb-000016
代表曲面函数与距离函数之间的差异,‖‖ 2代表范数的平方。上述公式(7)可理解为找曲面函数
Figure PCTCN2022079310-appb-000017
与距离函数d(u)之间的差异最小的情况下的A,b,c。应理解的是,w(u)也可以采用其它权重分布函数代替,例如可以采用欧氏距离、余弦距离等,对此本公开实施例不作限制。
其中,w(u)可以理解为一个常数矩阵,可理解的是,在目标函数的求解过程中,对于每一个自变量u都是可导的,二阶多项式(也即二次曲面)拟合过程可以作为一个可导的层,嵌入到神经网络训练中。
在一种可能的实现方式中,为使估计出来的A是正定矩阵,可以设置A中非对角线的元素全为0,只优化对角线上的元素,以及,若对角线上的元素出现负数,可以用函数max(0,·)将负数的元素改为0,通过该方式,可以减少运算量并快速得到矩阵A中的元素值。其中,考虑到忽略非对角线元素会使得估计出来的二次曲面是各向同性的,但由于可以将这个拟合过程嵌入到神经网络训练过程中,公式(5)所示的局部距离场(即距离函数)是可以通过反向传播来修正的,从而有效弥补局部距离场表达的局限性。
在一种可能的实现方式中,曲面参数包括第一参数(如上述A)与第二参数(如上述b),第一参数为2×2的矩阵,第二参数为2×1的向量第一参数的参数值包括矩阵中对角线上的两个第一元素值,第二参数的参数值包括向量中的两个第二元素值,也即,曲面参数的参数值包括两个第一元素值以及两个第二元素值。其中,根据曲面参数的参数值,确定指定偏移量,包括:根据两个第一元素值与两个第二元素值,确定纵轴偏移量与横轴偏移量,指定偏移量包括纵轴偏移量与横轴偏移量。通过该方式,可以有效得到横轴偏移量与纵轴偏移量。
如上所述,通常用横坐标与纵坐标表征图像上的位置,在一种可能的实现方式中,可以通过公式(8)实现根据两个第一元素值与两个第二元素值,确定纵轴偏移量与横轴偏移量。
Figure PCTCN2022079310-appb-000018
其中,u *代表指定偏移量,A (0,0)和A (1,1)分别代表矩阵中对角线上的两个第一元素值,A (0,0)可以代表矩阵对角线上的左上元素值,A (1,1)可以代表矩阵对角线的右下元素值,b (0)和b (1)可以代表向量中的两个第二元素值,b (0)可以依次分表代表向量中第一个元素值,b (1)可以代表向量中第二个元素值,∈为一个极小的常数来保证除法数值稳定,也即使分母不为0,
Figure PCTCN2022079310-appb-000019
代表横轴偏移量,
Figure PCTCN2022079310-appb-000020
代表纵轴偏移量,亚像素位置可以表示为j *=j+u *=i+p *+u *
在本公开实施例中,能够准确有效地确定出亚像素位置,便于之后基于亚像素位置得到整合特征图。
可知晓的是,初始待插帧通常是基于与该初始待插帧时序相邻的前后两帧原始视频帧确定的,也即,原始视频帧可以包括至少两帧,通过上述本公开实施例步骤S13241至步骤S13243得到的第s *尺度的整合特征图包括至少两个,在一种可能的实现方式中,在步骤S1325中,根据整合特征图、预估帧事件特征图以及融合特征图,对预估待插帧进行优化,得到目标待插帧,包括:
步骤S13251:根据第s *尺度的预估帧事件特征图以及至少两个第s *尺度的整合特征图,确定第s *尺度的目标整合特征图。
其中,可以参照上述本公开实施例步骤S13241至步骤S13243得到各个第s *尺度的整合特征图,在此不做赘述。
在一种可能的实现方式中,可以计算第s *尺度的预估帧事件特征图分别与各个第s *尺度的整合特征图之间的相似度,并将相似度最大的第s *尺度的整合特征图,确定为第s *尺度的目标整合特征图。其中,例如可以采用两个特征图之间的欧式距离或余弦距离,表征该两个特征图之间的相似度。
在一种可能的实现方式中,将相似度最大的第s *尺度的整合特征图作为第s *尺度的目标整合特征图,也即,从至少两个第s *尺度的整合特征图中,选取与第s *尺度的预估帧事件特征图最相似的整合特征图,作为第s *尺度的目标整合特征图。通过该方式,可以快速确定出各个尺度的预估帧事件特征图更接近的目标整合特征图。
步骤S13252:根据S *个尺度的目标整合特征图、预估帧事件特征图以及融合特征图,对预估待 插帧进行优化,得到目标待插帧。
如上所述,预估帧事件特征图可以是多尺度的,融合特征图可以是通过上述本公开实施例中步骤S1311至步骤S1313对初始帧特征图与事件特征图进行多尺度融合得到的,也即融合特征图可以是多尺度的。应理解的是,同一尺度的目标整合特征图、预估帧事件特征图以及融合特征图三者之间的尺寸相同。
在一种可能的实现方式中,根据S *个尺度的目标整合特征图、预估帧事件特征图以及融合特征图,对预估待插帧进行优化,得到目标待插帧,包括:
步骤S132521:根据第(S-S *)尺度的目标整合特征图、第(S-S *)尺度的预估帧事件特征图以及第(S-S *)尺度的融合特征图,得到第(S-S *)尺度的目标融合特征图。
在一种可能的实现方式中,根据第(S-S *)尺度的目标整合特征图、第(S-S *)尺度的预估帧事件特征图以及第(S-S *)尺度的融合特征图,得到第(S-S *)尺度的目标融合特征图,包括:
提取第(S-S *)尺度的预估帧事件特征图的残差特征,得到第(S-S *)尺度的残差特征图;将第(S-S *)尺度的残差特征图、第(S-S *)尺度的目标整合特征图以及第(S-S *)尺度的融合特征图进行通道拼接,得到目标拼接特征图;对目标拼接特征图进行滤波处理,得到第S-S *尺度的目标融合特征图。通过该方式,可以有效得到第S-S *尺度的目标融合特征图。
其中,可以通过残差网络,提取第S-S *尺度的预估帧事件特征图的残差特征,得到第S-S *尺度的残差特征图,对于残差网络的网络本公开实施例不作限制。可参照上述本公开实施例中得到拼接特征图的方式,实现将第S-S *尺度的残差特征图、第S-S *尺度的目标整合特征图以及第S-S *尺度的融合特征图进行通道拼接,得到目标拼接特征图,在此不做赘述。
在一种可能的实现方式中,例如可以通过卷积核为1×1尺寸的卷积层,对目标拼接特征图进行滤波处理,得到第S-S *尺度的融合特征图,其中,卷积层中卷积核的数量与第S-S *尺度的目标整合特征图的通道数相同。应理解的是,第S-S *尺度的目标融合特征图,也即为最小尺度的目标融合特征图,第S-S *尺度的目标融合特征图的尺寸以及通道数,与第S-S *尺度的目标整合特征图相同。
步骤S132522:对第(s *-1)尺度的目标融合特征图、第s *尺度的目标整合特征图以及第s *尺度的目标整合特征图进行特征融合,得到第s *尺度的目标融合特征图。
其中,可以参照上述本公开实施例步骤S1313中生成第s尺度的融合特征图的实现方式,实现对第s *-1尺度的目标融合特征图、第s *尺度的目标整合特征图以及第s *尺度的融合特征图进行特征融合,得到第s *尺度的目标融合特征图。
也即,可以对第s *-1尺度的目标融合特征图进行上采样,得到目标上采样特征图;对目标上采样特征图进行卷积处理以及非线性处理,得到上采样特征图对应的目标掩码图;根据目标掩码图,将第s *尺度的目标整合特征图以及第s *尺度的融合特征图进行特征融合,得到第s *尺度的目标融合特征图。
步骤S132523:提取第s *尺度的目标融合特征图中的残差特征,得到第s *尺度的残差特征图。
在一种可能的实现方式中,可以通过残差网络提取第s *尺度的目标融合特征图中的残差特征,得到第s *尺度的残差特征图。应理解的是,对于残差网络的网络结构,本公开实施例不作限制。
步骤S132524:对第S尺度的残差特征图进行解码处理,得到解码后的残差信息。
在一种可能的实现方式中,可以通过指定解码网络对第S尺度的残差特征进行解码处理,得到解码后的残差信息。应理解的是,指定解码网络的网络结构可以与上述提取原始帧事件特征图以及预估帧事件特征图所用的多层卷积层对应,也即上述多层卷积层可理解为编码网络。对于残差网络与指定解码网络的网络结构,本公开实施例不作限制。
通过该方式,能够提取目标融合特征图中表征图像细节的残差信息,进而将预估待插帧与残差信息进行叠加所得到的目标待插帧的图像质量更高。
步骤S132525:将残差信息叠加至预估待插帧中,得到目标待插帧。
如上所述,残差信息是从残差特征图中提取得到,残差信息也可以是采用“图”的形式,基于此,将残差信息叠加至预估待插帧中,可以理解为,将残差信息与预估待插帧进行图像融合。其中,可以采用本领域已知的图像融合技术,例如对同一位置处的像素值进行加权平均、或对像素值进行叠加等方式,对此本公开实施例不作限制。
在本公开实施例中,能够将与预估帧事件特征图相似度更高的目标整合特征图、预估帧事件特征图与融合特征图三者进行融合,并提取目标融合特征图中表征图像细节的残差信息,进而将预估待插帧与残差信息进行叠加所得到的目标待插帧的图像质量更高。
考虑到,任意待插帧中的每一个像素点,通常都能在该待插帧前后两个相邻的原始视频帧中找 到最匹配的像素点,换句话说,任意待插帧中部分像素点可能是与在先相邻的原始视频帧中在同一位置处的像素点最匹配,而部分像素点可能是与在后相邻的原始视频帧中在同一位置处的像素点最匹配。
在一种可能的实现方式中,在步骤S13251中,根据第s *尺度的预估帧事件特征图,分别与至少两个第s *尺度的整合特征图之间的特征相似度,确定第s *尺度的目标整合特征图,包括:
针对第s *尺度的预估帧事件特征图中的任一个第二像素点,从至少两个第s *尺度的整合特征图中,确定出与第二像素点匹配的目标匹配像素点;根据各个与第二像素点匹配的目标匹配像素点处的特征信息,生成第s *尺度的目标整合特征图。通过该方式,能够在第s *尺度的整合特征图包括至少两个的情况下,确定出与各个第二像素点匹配的目标匹配像素点,从而得到与第s *尺度的预估帧事件特征图最匹配的第s *尺度的目标整合特征图。
在一种可能的实现方式中,特征信息包括各个目标匹配像素点处的特征值,根据各个与第二像素点匹配的目标匹配像素点处的特征信息,生成第s *尺度的目标整合特征图,可以包括:根据第s *尺度的预估帧事件特征图中的每个第二像素点的像素位置,对各个目标匹配像素点处的特征值按像素位置进行排列,生成第s *尺度的目标整合特征图;或者说,根据每个第二像素点的像素位置,对与第s *尺度的整合特征图的尺寸相同的空白特征图,添加各个目标匹配像素点处的特征值,生成第s *尺度的目标整合特征图。
在一种可能的实现方式中,针对第s *尺度的预估帧事件特征图中的任一个第二像素点,从至少两个第s *尺度的整合特征图中,确定出与第二像素点匹配的目标匹配像素点,包括:
针对任一个第s *尺度的整合特征图,根据第二像素点与第s *尺度的整合特征图中各个像素点之间的特征相似度,从第s *尺度的整合特征图中确定出与第二像素点匹配的第二匹配像素点;
根据至少两个第二匹配像素点各自对应的特征相似度,将至少两个第二匹配像素点中特征相似度最大的第二匹配像素点,确定为与所述第二像素点匹配的目标匹配像素点。
在一种可能的实现方式中,可以参照上述本公开实施例步骤S13241的实现方式,实现根据第二像素点与第s *尺度的整合特征图中各个像素点之间的特征相似度,从第s *尺度的整合特征图中确定出与第二像素点匹配的第二匹配像素点,在此不做赘述。
考虑到,为了提高确定第二匹配像素点的效率,在一种可能的实现方式中,根据第二像素点与第s *尺度的整合特征图中各个像素点之间的特征相似度,从第s *尺度的整合特征图中确定出与第二像素点匹配的第二匹配像素点,可以包括:根据第二像素点与第s *尺度的整合特征图中、在指定窗口内的各个像素点之间的特征相似度,从第s *尺度的整合特征图中确定出与第二像素点匹配的第二匹配像素点。如上所述,例如可以采用欧式距离、余弦距离等方式计算像素点之间的特征相似度,对此本公开实施例不作限制。
其中,上述指定窗口例如可以是每个第二像素点的像素位置为中心、周围的(2m+1)2大小的局部窗口,m可以根据实际需求设置,例如可以是设置为3,对此本公开实施例不作限制。通过该方式,能够缩小在原始帧事件特征图中检索目标匹配像素点的范围,减少运算量,进行提高确定目标匹配像素点的效率。
其中,根据至少两个第二匹配像素点各自对应的特征相似度,将至少两个第二匹配像素点中特征相似度最大的第二匹配像素点,确定为与第二像素点匹配的目标匹配像素点,可以理解为,针对某个第二像素点,先从每个第s *尺度的整合特征图中,确定出与该像素点匹配的第二匹配像素点;进而根据每个第二匹配像素点对应的特征相似度,从各个第二匹配像素点中确定出特征相似度最大(也即欧式距离或余弦距离最小)的第二匹配像素点,作为与该第二像素点匹配的目标匹配像素点。
基于上述确定目标匹配特征点的实现方式,以两个第s *尺度的整合特征图为例,公式(9)示出根据本公开实施例一种确定第s *尺度的目标整合特征图的方式。
Figure PCTCN2022079310-appb-000021
其中,
Figure PCTCN2022079310-appb-000022
i *代表第s *尺度的预估帧事件特征图中的任一个第二像素点的像素位置,
Figure PCTCN2022079310-appb-000023
代表一个第s *尺度的整合特征图上第二匹配像素点的像素位置,
Figure PCTCN2022079310-appb-000024
代表另一个第s *尺度的整合特征图上第二匹配像素点的像素位置,
Figure PCTCN2022079310-appb-000025
代表第二像素点处的特征值,
Figure PCTCN2022079310-appb-000026
代表一个第s *尺度的整合特征图上第二匹配像素点处的像素值,
Figure PCTCN2022079310-appb-000027
代表另一个第s *尺度的整合特征图上第二匹配像素点处的像素 值;
Figure PCTCN2022079310-appb-000028
代表第s *尺度的目标整合特征图上像素位置i *处的特征值,
Figure PCTCN2022079310-appb-000029
代表一个第s *尺度的整合特征图上像素位置
Figure PCTCN2022079310-appb-000030
处的特征值,
Figure PCTCN2022079310-appb-000031
代表另一个第s *尺度的整合特征图上像素位置
Figure PCTCN2022079310-appb-000032
处的特征值,
Figure PCTCN2022079310-appb-000033
代表任一个第二像素点与一个第s *尺度的整合特征图上第二匹配像素点之间的欧式距离,
Figure PCTCN2022079310-appb-000034
代表任一个第二像素点与另一个第s *尺度的整合特征图上第二匹配像素点之间的欧式距离。
上述公式(9)可理解为根据两个第s *尺度的融合特征图分别与第s *尺度的预估帧事件特征图之间的欧式距离,在两个第s *尺度的融合特征图上择优选取一个欧式距离最小的特征值,作为第s *尺度的目标整合特征图上的特征值。
在本公开实施例中,能够在第s *尺度的整合特征图包括至少两个的情况下,确定出与各个第二像素点匹配的目标匹配像素点,从而得到与第s *尺度的预估帧事件特征图最匹配的第s *尺度的目标整合特征图。
如上所述,第一事件信息可以是根据事件相机采集的事件信号确定的,事件信号可以表征事件相机所拍摄物体上亮度发生变化的采集点、在一定时间区间内的亮度变化程度。在一种可能的实现方式中,在步骤S11中,获取待处理视频对应的初始待插帧,以及初始待插帧对应的第一事件信息,包括:
步骤S111:根据指定的插帧时刻,以及原始视频帧中与插帧时刻相邻的原始视频帧,生成初始待插帧,待处理视频是事件相机采集的;
步骤S112:根据事件相机在插帧时刻对应的时间区间内所采集的事件信号,确定第一事件信息,事件信号用于表征事件相机所拍摄物体上亮度发生变化的采集点、在时间区间内的亮度变化程度。
应理解的是,任意两帧原始视频帧中间可以插入至少一个待插帧,用户可以指定两帧原始视频帧中间至少一个插帧时刻,以便于通过上述本领域已知的光流估计算法,计算任意两帧原始视频帧到各个插帧时刻的光流,并根据光流将原始视频帧通过前向渲染(也即前向映射)等方式,渲染得到初始待插帧。对于初始待插帧的数量以及生成方式,本公开实施例不作限制。
其中,插帧时刻对应的时间区间,可以理解为插帧时刻所在的时间窗口,在一种可能的实现方式中,任意插帧时刻t对应的时间区间可以为(t-τ,t+τ),其中,τ例如可以是与插帧时刻相邻的两帧原始视频帧之间时长的一半,或1/3等,可依据待插入视频帧的帧率确定,对此本公开实施例不作限制。
举例来说,假设插帧时刻为t,t可以是一个归一化的分数时刻,可以将插帧时刻所在时间窗口(t-τ,t+τ)内采集的事件信号进行累加,得到第一事件信息。如上所述,第一事件信息可以采用“图”的形式记录上述时间区间内采集的事件信号的累加值,通过该方式,可以之后便于提取第一事件信息中的事件特征图。
在本公开实施例中,可以有效得到初始待插帧以及初始待插帧对应的第一事件信息。
为了便于对第一事件信息进行特征提取,可以将初始待插帧的插帧时刻处采集的事件信号转换为以多通道的张量,也即得到第一事件信息,在一种可能的实现方式中,在步骤S112中,根据事件相机在插帧时刻对应的时间区间内所采集的事件信号,确定第一事件信息,包括:
步骤S1121:将时间区间内所采集的事件信号划分为M组事件信号,M为正整数。
如上所述,在事件相机所拍摄场景中的物体运动或光照改变造成亮度变化的情况下,事件相机会产生一系列微秒级的事件信号,这些事件信号可以事件流的方式输出。基于此,可理解的是,插帧时刻对应的时间区间内所采集的事件信号包括多个。
其中,M的值可以根据实际需求、特征提取网络的网络结构等设置,例如可以设置为20,对此本公开实施例不作限制。
步骤S1122:针对第m组事件信号,按照预设的信号过滤区间,从第m组事件信号中筛除处于信号过滤区间外的事件信号,得到第m组目标事件信号,m∈[1,M]。
在一种可能的实现方式中,信号过滤区间可以是预先设置的用于过滤异常事件信号的信号区间,例如,信号过滤区间可以设置为[-10,10],其中,信号过滤区间可以根据历史经验、事件相机的固有参数等设置,对此本公开实施例不作限制。
其中,异常事件信号可以理解为不正常情况(例如环境光的亮度突然增大等)下采集的事件信号,通常情况下,异常事件信号的值会过大或过小,包含异常事件信号的事件信息可能无法准确表征物体的运动轨迹。
那么对于每组事件信号,从第m组事件信号中筛除处于信号过滤区间外的事件信号,可以理解 为,过滤掉第m组事件信号中的异常事件信信号,通过该方式,可以使第m组事件信号中包含有效正常的事件信号,从而使基于M组目标事件信号生成的第一事件信息能准确表征物体的运动轨迹。
步骤S1123:根据第m组目标事件信号中、各个目标事件信号的极性以及信号位置,将同一信号位置处的目标事件信号进行累加,得到第m个子事件信息,信号位置用于表征与目标事件信号对应的采集点、在事件相机的成像平面中的坐标位置,其中,第一事件信息包括M个子事件信息。
可知晓的是,事件相机采集的事件信号是带有极性的,也即事件信号中有负数有正数。如上所述,事件相机可以同时采集事件信号与视频信号,事件信号表征的是事件相机所拍摄物体上亮度发生变化的采集点、在时间区间内的亮度变化程度,每个亮度发生变化的采集点在事件相机的成像平面中会映射有对应的坐标位置。
其中,根据第m组目标事件信号中、各个目标事件信号的极性以及信号位置,将同一信号位置处的目标事件信号进行累加,得到第m个子事件信息,可以理解为,处于同一组内的目标事件信号会按照各自的极性以及信号位置进行聚合累加,得到第m个子事件信息。
如上所述,第一事件信息可以采用“图”的形式记录上述时间区间内采集的事件信号的累加值,那么第m个子事件信息可以理解为第一事件信息的第m个通道,第一事件信息可以是M个通道的图,或者说M个通道的张量。
在本公开实施例中,能够将插帧时刻对应的时间区间内采集的事件信号,有效转换成多通道的第一事件信息,从而便于之后提取第一事件信息的事件特征图。
在一种可能的实现方式中,上述本公开实施例中的视频插帧方法是通过图像处理网络实现的,图4示出根据本公开实施例的一种图像处理网络的示意图,如图4所示,所述图像处理网络包括互补信息融合网络与亚像素运动注意力网络,互补信息融合网络包括双分支特征提取子网络(即图4中两个Unet)与多尺度自适应融合子网络(即图4中AAFB)。
如图4所示,在一种可能的实现方式中,在步骤S12中,分别对初始待插帧以及第一事件信息进行特征提取,得到初始待插帧对应的初始帧特征图以及第一事件信息对应的事件特征图,包括:通过双分支特征提取子网络,分别对初始待插帧(I 0→1与I 0→2)以及第一事件信息(E 1)进行特征提取,得到初始待插帧对应的初始帧特征图f s以及第一事件信息对应的事件特征图e s。通过该方式,可以有效生成初始帧特征图与事件特征图。
在一种可能的实现方式中,如图4所示,双分支特征提取网络的每个分支可以采用UNet网络,每个UNet网络可以包括5组卷积层,第一组卷积层保留了输入数据的分辨率,而其它卷积层在长和宽维度上,分别将输入特征图下采样为原来的1/2,5组卷积层将特征通道数扩展为32,64,128,256,256个。应理解的是,以上双分支特征提取网络的网络结构是本公开实施例提供的一种实现方式,实际上,本领域技术人员可以根据需求设计双分支特征提取网络的网络结构,对于双分支特征提取网络的网络结构本公开实施例不作限制。
如图4所示,初始帧特征图f s为5个尺度的特征图,f s可以代表第s尺度的初始帧特征图,事件特征图e s为5个尺度的特征图,e s表示第s尺度的事件特征图,也即s∈{0,1,2,3,4}。其中,f 0代表第0尺度的初始帧特征图,e 0代表第0尺度的事件特征图,X 0代表第0尺度的融合特征图,其它f 1~f 4、e 1~e 4、X 1~X 4以此类推,不做赘述。
在一种可能的实现方式中,在步骤S131中,根据初始帧特征图与事件特征图,生成预估待插帧,包括:通过多尺度自适应融合子网络,实现根据初始帧特征图f s与事件特征图f s,生成预估待插帧
Figure PCTCN2022079310-appb-000035
通过该方式,可快速准确地生成预估待插帧。
在一种可能的实现方式中,在步骤S132中,根据与初始待插帧相邻的原始视频帧以及原始视频帧对应的第二事件信息,对预估待插帧进行优化,得到目标待插帧,包括:通过亚像素运动注意力网络,实现根据与初始待插帧相邻的原始视频帧以及原始视频帧对应的第二事件信息,对预估待插帧进行优化,得到目标待插帧。通过该方式,可准确优化预估待插帧,得到图像质量更高的目标待插帧。
如图4中亚像素运动注意力网络,I 0与I 2代表与初始待插帧的插帧时刻相邻的原始视频帧,E 0与E 2代表与原始视频帧(I 0与I 2)分别对应的第二事件信息,<I 0,E 0>与<I 2,E 2>代表两个原始帧事件组合信息,
Figure PCTCN2022079310-appb-000036
代表预估帧事件组合信息。
如图4所示,亚像素运动注意力网络可以包括特征提取子网络,在步骤S1323中,通过特征提取子网络分别对预估帧事件组合信息与原始帧事件组合信息进行特征提取,得到预估帧事件组合信息对应的预估帧事件特征图
Figure PCTCN2022079310-appb-000037
以及原始帧事件组合信息对应的原始帧事件特征图(
Figure PCTCN2022079310-appb-000038
Figure PCTCN2022079310-appb-000039
)。其中,特征提取模块可以包括参数共享的三层卷积层,
Figure PCTCN2022079310-appb-000040
Figure PCTCN2022079310-appb-000041
可以分别是3个尺度的特征 图,s *∈{2,3,4}。
如图4所示,亚像素运动注意力网络可以包括亚像素注意力子网络与亚像素整合子网络,在一种可能的实现方式中,在步骤S1324中,可以通过亚像素注意力子网络,实现根据预估帧事件特征图,对原始帧事件特征图进行调整,得到整合特征图(
Figure PCTCN2022079310-appb-000042
Figure PCTCN2022079310-appb-000043
)。
在一种可能的实现方式中,在步骤S13251中,通过亚像素整合子网络,实现根据第s *尺度的预估帧事件特征图,分别与至少两个第s *尺度的整合特征图之间的特征相似度,确定第s *尺度的目标整合特征图
Figure PCTCN2022079310-appb-000044
其中,
Figure PCTCN2022079310-appb-000045
代表第2尺度的目标整合特征图,
Figure PCTCN2022079310-appb-000046
第3尺度的目标整合特征图,
Figure PCTCN2022079310-appb-000047
第4尺度的目标整合特征图。应理解的是,第s *尺度的目标整合特征图与第s *尺度的融合特征图三者之间的尺寸相同。
如图4所示,亚像素运动注意力网络可以包括多尺度自适应融合子网络AAFB、残差网络以及解码网络(未在图4中示出),在一种可能的实现方式中,在步骤S132521中,可以通过残差网络提取第S-S *尺度的预估帧事件特征图的残差特征,得到第S-S *尺度的残差特征图(如图4中R 2,代表2尺度的残差特征图),进而将第S-S *尺度的残差特征图(如R2)、第S-S *尺度的目标整合特征图(如
Figure PCTCN2022079310-appb-000048
)以及第S-S *尺度的融合特征图(如X 2)进行通道拼接以及滤波处理,得到第S-S *尺度的目标融合特征图。
在一种可能的实现方式中,在步骤S132522中,通过多尺度自适应融合子网络AAFB,对第s *-1尺度的目标融合特征图、第s *尺度的目标整合特征图以及第s *尺度的融合特征图进行特征融合,得到第s *尺度的目标融合特征图。
在一种可能的实现方式中,在步骤S132523中,通过残差网络,提取第s *尺度的目标融合特征图中的残差特征,得到第s *尺度的残差特征图
Figure PCTCN2022079310-appb-000049
应理解的是,R 3代表第3尺度的残差特征图,其中R 4代表第4尺度的残差特征图。
在一种可能的实现方式中,在步骤S132524中,通过解码网络对第S尺度的残差特征图(如R 4)进行解码处理,得到解码后的残差信息R s。其中,将残差信息R s叠加至预估待插帧
Figure PCTCN2022079310-appb-000050
中,得到目标待插帧
Figure PCTCN2022079310-appb-000051
可以表示为:
Figure PCTCN2022079310-appb-000052
需要说明的是,图4示出的图像处理网络是本公开实施例提供的一种实现方式,实际上,本领域技术人员可以根据实际需求设计用于实现本公开实施例的视频插帧方式的图像处理网络,对此本公开实施例不作限制。
在本公开实施例中,能够通过图像处理网络,准确高效地生成目标待插帧。
应理解的是,在部署使用图像处理网络前,通常需要对图像处理网络进行训练,在一种可能的实现方式中,所述方法还包括:
根据样本视频,训练初始图像处理网络,得到图像处理网络,样本视频包括样本中间帧以及与样本中间帧相邻的样本视频帧。
应理解的是,初始图像处理网络的网络结构与图像处理网络相同,网络参数可能不同,样本中间帧可以是样本视频中两帧样本视频帧之间的中间视频帧,也即样本中间帧也是样本视频中原始的视频帧。
其中,根据样本视频,训练初始图像处理网络,得到图像处理网络,包括:
根据样本中间帧对应的中间时刻以及样本视频帧,生成初始中间帧;
将样本视频帧以及初始中间帧输入至初始图像处理网络中,得到初始图像处理网络输出的预测中间帧;
根据预测中间帧与样本中间帧之间的损失,更新初始图像处理网络的网络参数至损失满足预设条件,得到图像处理网络。
其中,可以参照上述本公开实施例步骤S111的方式,实现根据样本中间帧对应的中间时刻以及样本视频帧,生成初始中间帧,也即通过上述本领域已知的光流估计算法,计算样本视频帧到中间时刻的光流,并根据光流将样本视频帧通过前向渲染(也即前向映射)等方式,渲染得到初始中间帧。
应理解的是,将样本视频帧以及初始中间帧输入至初始图像处理网络中,得到初始图像处理网络输出的预测中间帧,可以参照上述本公开实施例通过图像处理网络生成目标待插帧的实现过程,在此不做赘述。
在一种可能的实现方式中,可以采用本领域已知的损失函数,例如,沙博尼耶损失函数(Charbonnier Loss)等,计算预测中间帧与样本中间帧之间的损失,对此本公开实施例不作限制。
在一种可能的实现方式中,预设条件例如可以包括:损失收敛、损失置0、迭代次数达到指定次数等,对此本公开实施例不作限制。
在本公开实施例中,能够使训练后的图像处理网络,准确高效地生成目标待插帧。
如上所述,图像处理网络包括互补信息融合网络与亚像素运动注意力网络,为提高图像处理网络的训练效率,可以先训练互补信息融合网络,在互补信息融合网络的损失收敛后,固定互补信息融合网络的网络参数,再接着训练亚像素运动注意力网络。
在一种可能的实现方式中,初始图像处理网络包括初始互补信息融合网络与初始亚像素运动注意力网络,预测中间帧包括:初始互补信息融合网络输出的第一预测中间帧,以及初始亚像素运动注意力网络输出的第二预测中间帧;
其中,根据预测中间帧与所述样本中间帧之间的损失,更新初始图像处理网络的网络参数至损失满足预设条件,得到图像处理网络,包括:
根据第一预测中间帧与样本中间帧之间的第一损失,更新初始互补信息融合网络的网络参数至第一损失收敛,得到互补信息融合网络;
将互补信息融合网络输出的样本预测中间帧,输入至初始亚像素运动注意力网络,得到第二预测中间帧;
根据第二预测中间帧与样本中间帧之间的第二损失,更新初始亚像素运动注意力网络的网络参数至第二损失收敛,得到亚像素运动注意力网络。
上述对初始图像处理网络的训练过程,可以理解为包含两个阶段的网络训练。其中,第一阶段的网络训练,先训练初始互补信息融合网络,在初始互补信息融合网络的第一损失收敛后,固定初始互补信息融合网络的网络参数,得到互补信息融合网络。
第二阶段的网络训练,利用训练后的互补信息融合网络输出的样本预测中间帧,作为初始亚像素运动注意力网络的输入数据,得到初始亚像素运动注意力网络输出的第二预测中间帧,在利用第二预测中间帧与样本中间帧之间的第二损失,更新初始亚像素运动注意力网络的网络参数至第二损失收敛,得到训练后的亚像素运动注意力网络。
在本公开实施例中,能够分阶段训练图像处理网络,提高图像处理网络的训练效率。
需要说明的是,本公开实施例中特征图的“尺度”,可以理解为,神经网络的不同层级下提取的特征图,或者说,用尺度区分不同层级网络所提取的特征图,特征图的“尺寸”可理解为不同尺度的特征图的长宽高,或者说不同尺度的特征图的分辨率。应理解的是,不同尺度的特征图的尺寸可以不同,同一尺度下的特征图的尺寸可以相同。
本公开实施例提供一种视频插帧方法,该视频插帧方法包括:互补信息融合阶段以及亚像素注意力的画质增强阶段。
在互补信息融合阶段中,给定连续的两个稀疏采样的原始视频帧
Figure PCTCN2022079310-appb-000053
Figure PCTCN2022079310-appb-000054
和同一场景下同步采样得到的事件信号。本公开实施例的目的是两帧原始视频帧在t∈(0,1)的任意插帧时刻处合成并插入某一中间帧
Figure PCTCN2022079310-appb-000055
其中t是一个归一化的分数时刻。对于t时刻的视频帧
Figure PCTCN2022079310-appb-000056
在局部范围的时间窗口内可以获得相关的事件信息
Figure PCTCN2022079310-appb-000057
在互补信息融合阶段,首先利用计算得到的光流,将
Figure PCTCN2022079310-appb-000058
Figure PCTCN2022079310-appb-000059
中的像素移动到和插帧时刻处视频帧对齐的位置,此过程将会输出2个粗糙的初始待插帧,该初始待插帧在光流估计不准确的地方可以观察到明显的误差。互补信息融合阶段则可以利用从插帧时刻处的事件信息
Figure PCTCN2022079310-appb-000060
中挖掘互补的运动轨迹信息来修正这些误差。
其中,本公开实施例使用了两个Unet(可以采用任意相关的多尺度特征提取网络)分别提取事件信息和视频信号的特征,然后通过自适应外观互补融合网络(如图4中的AAFB),将提取的两个特征进行融合,最终输出优化的预估待插帧
Figure PCTCN2022079310-appb-000061
其中,为了探索运动上下文信息,从而进一步优化预估待插帧的画质,本公开实施例使用了注意力机制,来进行第二阶段对预估待插帧的优化。其中,可以将预估待插帧与对应事件信息的组合信息
Figure PCTCN2022079310-appb-000062
作为查询信息,其相邻的原始视频与对应事件信息的组合信息
Figure PCTCN2022079310-appb-000063
作为键值,通过亚像素精度的注意力机制来更精确的将查询信息和键值信息匹配,通过这种匹配关系,与每一个查询信息相关的键值信息可以被更精确地检索出来,并使用亚像素精度的图像块位移方法来聚合相关的内容,最终输出一个多尺度的上下文特征(即上述整合特征图);进而将该上下文特征与互补信息融合阶段产生的多尺度特征利用AAFB进行进一步融合,并通过若干残差网络处理输出进一步优化的目标待插帧。
其中,针对外观互补信息融合阶段。可以利用本领域已知的光流估计算法来分别计算
Figure PCTCN2022079310-appb-000064
Figure PCTCN2022079310-appb-000065
分别到插帧时刻的光流,并根据光流将
Figure PCTCN2022079310-appb-000066
Figure PCTCN2022079310-appb-000067
通过前向渲染的方法渲染得到初始待插帧
Figure PCTCN2022079310-appb-000068
Figure PCTCN2022079310-appb-000069
作为双分支特征提取网络的一个输出。考虑到,由于事件信号是时间稠密的,为了能将事件信号合理地输入到双分支特征提取网络中,本公开实施例将
Figure PCTCN2022079310-appb-000070
信号等间距聚合成20通道的事件信息作为双分支特征提取网络的另一个输入。如图4所示,双分支特征提取网络可以是一个双分支的Unet,为了有效地聚合两种信息的特征,本公开实施例提出了一种多尺度自适应聚合网络(如图4中的AAFB),可以有效地将视频信号的特征和事件信号的特征在多尺度层级进行聚合。
本公开实施例提出的多尺度自适应聚合网络是一个由粗到细的逐尺度渐进聚合过程,在将第s个尺度聚合之后的特征记为X s的情况下,各个尺度的融合特征可以递归地通过公式(1)表示。
为了有效率地根据当前尺度的视频信号的特征图f s和事件信号的特征图e s来调制X s,可以将f s和e s看做是对同一潜在重建信息的不同视角表达。本公开实施例借鉴了相关技术中的重新一化思想,使得不同视角表达的特征可以在同一空间对齐,同时能够保持细粒度的空间细节。对于f s和e s两个随机变量,可以分别用两组独立的卷积层去学习空间可变的尺度和偏置c f,b f或c e,b e,然后将各个随机变量用上述公式(2-1)和(2-2)转换成可融合的特征图y e与y f
通常来讲,事件信号对于运动物体的边界有良好的感知能力,因为这种运动常常会造成图像的快速的亮度变化,而且基于纯视频信号的光流方法,在这种区域的估计值往往是不可靠的。但是对于纹理简单的区域,事件相机就捕捉到的事件信息的可靠程度不如基于视频信号提取的信息。可以将第s-1尺度的融合特征图对应的上采样特征图
Figure PCTCN2022079310-appb-000071
通过一个卷积层和sigmoid层来提取一个融合软掩码m,并利用该掩码m自适应地融合这两种互补的信息,该过程可参照上述公式(3)。
公式(2-1)、(2-1)和(3)组成一个递归的融合流程,由于该融合流程都是仿射变换,为了增加每个多尺度自适应融合网络的非线性,可以在每个网络的输出端,插入一个3x3卷积操作和LeakyRelu激活函数,以上提到的所有操作共同组合成了AAFB网络。
对于亚像素运动注意力阶段,本公开实施例采用了轻量级的注意力机制来捕获上下文信息,以进一步优化待插帧的画质效果。如图4所示,亚像素运动注意力阶段的输入主要是视频信号和事件信信息的组合信息
Figure PCTCN2022079310-appb-000072
然后将组合信息输入到3层参数共享的卷积网络,从而输出3个尺度的特征{v s|s∈{0,1,2}},其中,尺度个数可以多于或少于3,对此本公开实施例不作限制。
对于相关的信号组合(
Figure PCTCN2022079310-appb-000073
E 0)或(
Figure PCTCN2022079310-appb-000074
E 2),输出的各个尺度的
Figure PCTCN2022079310-appb-000075
Figure PCTCN2022079310-appb-000076
叫做“值”,而k 0或k 2叫做“键”。而由(
Figure PCTCN2022079310-appb-000077
E 1)计算产生的
Figure PCTCN2022079310-appb-000078
叫做“查询”。在注意力机制中,这些“键”、“值”和“查询”,构成了重要的组成元素,并常用于内存检索。
为了在“值”中检索信息,可以初始帧特征图
Figure PCTCN2022079310-appb-000079
中的每一个像素,对两个原始帧特征图进行检索。其中,由于这个检索过程是在输入图1/8分辨率的原始帧特征图上进行的,特征图上有限的位移投射回原尺寸图就是一个很大的位移,因此可以将这个相关性的检索范围限制在每个查询像素位置周围的(2m+1) 2大小的局部窗口范围内。在
Figure PCTCN2022079310-appb-000080
上给定一个像素位置i和一个偏移量p∈[-m,m] 2,将各个特征首先经过范数正则化,并用通过上述公式(4)示出的欧几里得距离来定义特征之间的相似度大小。
传统的注意力机制,常常通过软注意力机制来聚合信息,会首先对这个相关性矩阵进行softmax归一化,然后通过加权求和的方式对“值”中的所有位置信息进行聚合。对于图像合成任务来说,这可能会模糊即时特征,并造成最终合成的质量的退化。本公开实施例采用硬注意力机制,由于硬注意力机制会记录最匹配(也即相似度最大)的位置,也即与“查询”中的某一个特征向量欧几里得距离最小的“键”的位置。
考虑到,由于偏移量p是在1/8分辨率的特征图上进行计算的,就算是最优的偏置,在高分辨率特征图上仍然会有对齐误差。在一种可能的实现方式中,可以在低分辨率特征图上计算亚像素精度的注意力偏移,在将这种注意力机制等比例放大并应用到高分辨率的特征图上的情况下,这种方法可以在一定程度上缓解精度损失。对于
Figure PCTCN2022079310-appb-000081
上的某一个特征像素i,硬注意力机制在上原始帧特征图上计算出了最匹配的位置j,即j=i+p *其中p *=argmin pD(i,p)。更准确地讲,行向量元素{D(i,p)|p∈[-m,m] 2}可以被组织成(2m+1) 2个“查询”向量和“键”向量之间的距离,其中p *是最小距离所在的位置。
为了能够获得亚像素精度,以p *为中心的局部距离场可以被一个参数化的二阶多项式进行连续拟合,而这个多项式的全局极小值是有闭合解的。通过将最小二乘拟合融入到神经网络训练过程中,可以纠正局部距离场的形状,并得到亚像素精度的估计。
本公开实施例以p *为中心采样一个大小为(2n+1) 2的局部窗口,其中n例如可以设置n=1,并 将这个局部距离场记做d。则这个局部距离场可以定义为上述公式(5);为了使这个局部距离场在定义区间[-n,n] 2上有意义,可以该区域上定义了一个局部二次曲面如上述公式(6),公式(6)为一个具有全局极小值点的真实二次曲面;为了估计公式(6)中的未知参数A,b和c,可以使用加权最小二乘法,根据(2n+1) 2个已知的自变量u和其函数值d(u),来最小化小化公式(7)示出的目标函数。
可理解的是,w(u)可以是常数矩阵,那么该最小化求解目标函数的过程,对于每一个输入变量都是可导的,因此这个求解过程可以作为一个可导的层,很容易嵌入到图像处理网络训练中。考虑到,为了保证估计出来的A是正定的,本公开实施例假设A中非对角线的元素全为0,只优化对角线上的元素,并在对角线上的元素出现负数的情况下,将负数的元素修改为0。应理解的是,尽管忽略非对角线元素会使得估计出来的二次曲面是各向同性的,但是通过将这个求解过程嵌入到图像处理网络训练过程中,公式(5)所示的局部距离场是可以通过反向传播来修正的,并有效弥补其表达的局限性。并可以通过上述公式(6)得到亚像精度的匹配位置,也即亚像素位置。
通过上述步骤,对于
Figure PCTCN2022079310-appb-000082
上的每一个像素i,可以在原始帧事件特征图上找到一个与之相匹配的亚像素位置j *,并根据该亚像素位置j *将“值”原始帧事件特征图进行移动。其中,第s *尺度的原始帧事件特征图在长宽维度上是最小尺度的预设帧事件特征图大小的n倍。其中,可以在第s *尺度的原始帧事件特征图上以j *为中心通过双线性插值的方法裁切一个n×n大小的图像块。然后对多个图像块进行尺寸拼接,得到与第s *尺度的原始帧事件特征图相同尺寸且信息重组之后的整合特征图。
在上述过程中,可以同时在两个原始帧事件特征图上采用这种亚像素拟合和图像块移动的策略,产生了重整之后的两个整合特征图,之后可以参照公式(9),实现根据特征间的距离,在两个整合特征图上择优保留一个距离最小的特征,生成目标整合特征图。
通过上述过程,可以得到多尺度的目标整合特征图,进而可以将互补信息融合阶段输出的融合特征图以及目标整合特征图,通过上述多尺度自适应融合网络进行整合。整合之后的最高分辨率的特征图会最终通过一个解码器并输出预估待插帧
Figure PCTCN2022079310-appb-000083
的优化残差R 1,目标待插帧
Figure PCTCN2022079310-appb-000084
可以表示为
Figure PCTCN2022079310-appb-000085
在一种可能的实现方式中,对于某一个时刻t,可以将局部时间窗口(t-τ,t+τ)等间距地划分为20个组,其中τ表示的是连续两针之间的间隔时间的一半。落在同一个组内的事件信号会按照自己的极性按照像素位置聚合,并将最大最小值范围裁剪到[-10,10],最终会构成一个20通道的张量
Figure PCTCN2022079310-appb-000086
也即得到第一事件信息。
在一种可能的实现方式中,对于采用的双分支特征提网络,可以采用双分支的Unet网络,每个分支的Unet网络有4个尺度,每个尺度的编码器分别通过一组卷积网络将特征通道数扩展为32,64,128,256,256个,其中,第一组卷积网络保留了输入的分辨率,而其他的卷积网络在长和宽维度上,分别将特征图下采样为原来的1/2,解码器采用了对称的结构设计并和相应的编码器特征进行跳跃连接。在多尺度特征融合之后,最高分辨率的特征层再通过两个32通道的卷积层来产生最终的输出结果。
根据本公开实施例的视频插帧方法,首先是互补信息融合阶段,根据初始待插帧的插帧时刻,利用与该插帧时刻相关的事件信号,和该插帧时刻最近邻的左右两个原始视频帧进行特征提取和互补融合,从而合成一个初步的预估待插帧。之后是基于亚像素运动注意力的画质增强阶段,将合成的预估待插帧,通过再次使用与其相关的事件信号,以及最近邻左右两个原始视频帧及其相关事件信号,进行第二阶段的优化,从而得到一个人工痕迹更少,画质更优的目标待插帧。通过在相邻两个原始视频帧之间设定不同的插帧时刻,反复运行上述视频插帧方法,可以实现在两个原始视频帧之间进行若干数量的视频插帧过程。通过本公开实施例的视频插帧方法,能够利用事件相机采集的事件信号和低帧率的视频信号来合成目标待插帧,以进行视频插帧,得到高帧率的视频信号。
其中,在上述互补信息融合阶段,本公开实施例首先将插帧时刻左右两个原始帧通过光流估计算法进行像素移动,得到初始待插帧,并作为视频信号特征提取网络的输入,再提取与初始待插帧相关的事件信号作为事件信号特征提取网络的输入。并采用了两个参数相互独立的多尺度特征提取网络分别对视频信号和事件信号进行特征提取,得到两个多尺度的特征图,再利用一个多尺度自适应信息融合网络对两个多尺度的特征图进行融合,将最终的合成特征图通过一个解码器,输出一个初步合成的3通道彩色预估待插帧。
其中,在上述亚像素注意力的画质增强阶段,本公开实施例将互补信息融合阶段合成的预估待插帧、插帧时刻的左右两个原始视频帧分别与各自相关的事件信号叠加,作为共同特征提取网络的输入,用同一个特征提取网络分别对这三组信号进行特征提取,输出多尺度特征。
其中,在上述亚像素注意力的画质增强阶段,本公开实施例在最低尺度的特征图上使用注意力机制,将预估待插帧对应的特征图作为查询,其他两个原始视频帧对应的特征图作为键值,通过硬注意力机制提取出与预估待插帧每一个空间位置的特征最相关的特征位置,再利用该特征周围的局部距离场,拟合一个二次曲面,通过二次曲面的极大值,求出亚像素精度的最相似位置,最终通过双线性插值的方法,将两个键对应的信息进行重新整合,并将这种整合策略进行等比例放大,对其他尺度特征进行相似的组合,并通过保留最大相似度的方式,将两个整合的信息最终融合成一个多尺度信息。
本公开实施例,将整合得到的多尺度信息、预估待插帧对应的低尺度信息以及互补信息融合阶段提取的信息,再次通过多尺度自适应融合的方式进行特征融合与解码,最终得到残差信息。通过将预估待插帧与残差信息叠加,得到画质更优的目标待插帧。
相关技术中,大部分高质量的插帧算法都依赖于在高帧率的样本视频上进行训练,部分方法还需要依赖仿真方法合成事件信号,训练数据获取难度大,且仿真数据训练的模型泛化性差。根据本公开的实施例,能够直接基于低帧率的样本视频上进行网络训练,不依赖高帧率的样本视频和仿真方法。
相关技术中,需要利用光流估计算法设定运动轨迹模型,在实际运动轨迹不满足预设轨迹的情况下会带来性能下降。根据本公开的实施例,通过事件信息表征的运动轨迹信息,直接矫正初始待插帧的画质,并提供了一种更精确的注意力机制,通过更精确地检索并利用运动相关的上下文信息来提升预估待插帧的画质,具有更好的泛化性。
本公开实施例提出了一种将视频信号和事件信号互补融合的方法,通过利用运动敏感且时间上稠密的事件信号,弥补了在估计待插帧的物体运动时缺省的运动轨迹信息,并利用非运动区域记录完整的视频信号,弥补了事件信号对于非运动区域的信息。
本公开实施例提出了一种亚像素精度运动注意力机制,可以在低分辨率特征图上提取对物体运动敏感的亚像素精度注意力,从而可以在分辨率特征图上直接获取高分辨率的注意力信息,从而构造了更精确的注意力机制,通过更精确地检索并利用运动相关的上下文信息来提升画质。
根据本公开的实施例,利用无监督的图像处理网络的训练方式,更符合事件相机实际使用场景,降低了对训练数据的要求,并提高了网络训练的泛化性。
根据本公开实施例中的视频插帧方法,可以利用事件相机拍摄得到的低帧率的视频信号,以及对应场景的事件信号,合成该场景的高帧率的视频信号;还可以完成慢动作回放、提高视频码率(流畅性),稳定图像(电子稳像,视频防抖)等图像处理任务。
根据本公开实施例中的视频插帧方法,可以应用于任何利用事件相机构造的、需要视频插帧功能的产品中,例如视频播放软件,或视频安防软件的慢动作回放等。
可以理解,本公开提及的上述各个方法实施例,在不违背原理逻辑的情况下,均可以彼此相互结合形成结合后的实施例,限于篇幅,本公开不再赘述。本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
此外,本公开还提供了视频插帧装置、电子设备、计算机可读存储介质、程序,上述均可用来实现本公开提供的任一种视频插帧方法,相应技术方案和描述和参见方法部分的相应记载,不再赘述。
图5示出根据本公开实施例的视频插帧装置的框图,如图5所示,所述装置包括:
获取模块101,配置为获取待处理视频对应的初始待插帧,以及所述初始待插帧对应的第一事件信息,所述第一事件信息配置为表征所述初始待插帧中物体的运动轨迹;
特征提取模块102,配置为分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图;
生成模块103,配置为根据所述初始帧特征图与所述事件特征图,生成目标待插帧;
插帧模块104,配置为将所述目标待插帧插入至所述待处理视频中,得到处理后视频。
在一种可能的实现方式中,所述生成模块,包括:预估帧生成子模块,配置为根据所述初始帧特征图与所述事件特征图,生成预估待插帧;预估帧优化子模块,配置为根据所述待处理视频中、与所述初始待插帧的插帧时刻相邻的原始视频帧,以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧,所述第二事件信息配置为表征所述原始视频帧中物体的运动轨迹。
在一种可能的实现方式中,所述初始帧特征图包括S个尺度,所述事件特征图包括S个尺度,S为正整数,其中,所述预估帧生成子模块,包括:第一融合单元,配置为根据第0尺度的初始帧特 征图与第0尺度的事件特征图,得到第0尺度的融合特征图;对齐单元,配置为根据第(s-1)尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图;第二融合单元,配置为根据所述第(s-1)尺度的融合特征图、所述第s尺度的可融合初始帧特征图以及所述第s尺度的可融合事件特征图,得到第s尺度的融合特征图;解码单元,配置为对第(S-1)尺度的融合特征图进行解码处理,得到所述预估待插帧;其中,s∈[1,S)。
在一种可能的实现方式中,所述对齐单元,包括:上采样子单元,配置为对所述第(s-1)尺度的融合特征图进行上采样,得到上采样特征图,所述上采样特征图与所述第s尺度的初始帧特征图以及所述第s尺度的事件特征图的尺寸相同;第一转换子单元,配置为根据所述上采样特征图与所述第s尺度的初始帧特征图之间的第一空间转换关系,得到所述第s尺度的可融合初始帧特征图;第二转换子单元,配置为根据所述上采样特征图与所述第s尺度的事件特征图之间的第二空间转换关系,得到所述第s尺度的可融合事件特征图;其中,所述第s尺度的可融合初始帧特征图、所述第s尺度的可融合事件特征图与所述上采样特征图处于同一特征空间中。
在一种可能的实现方式中,所述第一空间转换关系是根据所述第s尺度的初始帧特征图在空间转换时的第一像素尺寸缩放信息与第一偏置信息,以及所述上采样特征图的特征信息确定的;所述第二空间转换关系是根据所述第s尺度的事件特征图在空间转换时的第二像素尺寸缩放信息与第二偏置信息,以及所述上采样特征图的特征信息确定的;其中,像素尺寸缩放信息表示空间转换中每个像素点的尺寸缩放比例,偏置信息表示空间转换中每个像素点的位置偏移量。
在一种可能的实现方式中,所述第二融合单元,包括:处理子单元,配置为对上采样特征图进行卷积处理以及非线性处理,得到所述上采样特征图对应的掩码图,其中,所述上采样特征图是对所述第(s-1)尺度的融合特征图进行上采样得到的;融合子单元,配置为根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到所述第s尺度的融合特征图。
在一种可能的实现方式中,所述融合子单元,包括:第一融合电路,配置为根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图;处理电路,配置为对所述第s尺度的初始融合特征图进行卷积处理以及非线性处理,得到所述第s尺度的融合特征图。
在一种可能的实现方式中,所述第一融合电路,配置为:计算所述掩码图与所述第s尺度的可融合事件特征图之间的哈达玛积;根据所述掩码图对应的反向掩码图,计算所述反向掩码图与所述第s尺度的可融合初始帧特征图之间的乘积;将所述哈达玛积与所述乘积相加,得到所述第s尺度的初始融合特征图。
在一种可能的实现方式中,所述第一融合单元,包括:拼接子单元,配置为将所述第0尺度的初始帧特征图与所述第0尺度的事件特征图进行通道拼接,得到拼接特征图;滤波子单元,配置为对所述拼接特征图进行滤波处理,得到所述第0尺度的融合特征图。
在一种可能的实现方式中,所述预估帧优化子模块,包括:第一组合单元,配置为将所述预估待插帧与所述第一事件信息进行组合,得到预估帧事件组合信息;第二组合单元,配置为将所述原始视频帧与所述第二事件信息进行组合,得到原始帧事件组合信息;提取单元,配置为分别对所述预估帧事件组合信息与所述原始帧事件组合信息进行特征提取,得到所述预估帧事件组合信息对应的预估帧事件特征图以及所述原始帧事件组合信息对应的原始帧事件特征图;调整单元,配置为根据所述预估帧事件特征图,对所述原始帧事件特征图进行调整,得到整合特征图;优化单元,配置为根据所述整合特征图、所述预估帧事件特征图以及融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧,所述融合特征图是对所述初始帧特征图与所述事件特征图进行多尺度融合得到的。
在一种可能的实现方式中,所述预估帧事件特征图包括S *个尺度,所述原始帧事件特征图包括S *个尺度,1≤S *≤S,S *为正整数,s *∈[(S-S *),S),第(S-S *)尺度的预估帧事件特征图的尺寸为I×I,I为正整数,其中,所述调整单元,包括:第一确定子单元,配置为针对第(S-S *)尺度的预估帧事件特征图中的任一个第一像素点,从第(S-S *)尺度的原始帧事件特征图中确定出与所述第一像素点匹配的第一匹配像素点;第二确定子单元,配置为根据所述第一匹配像素点的像素位置以及指定偏移量,确定与所述像素位置对应的亚像素位置,所述指定偏移量为小数;调整子单元,配置为根据I×I个所述亚像素位置,对第s *尺度的原始帧事件特征图进行调整,得到第s *尺度的整合特征图。
在一种可能的实现方式中,所述第一确定子单元,包括:计算电路,配置为针对任一个第一像素点,计算所述第一像素点分别与所述第(S-S *)尺度的原始帧事件特征图中、在指定窗口内的各个像素点之间的特征相似度,所述指定窗口是根据所述第一像素点的像素位置确定的;第一确定电路,配置为将所述指定窗口内的各个像素点中的、最大特征相似度所对应的像素点,确定为所述第一匹配像素点。
在一种可能的实现方式中,所述第二确定子单元,包括:第二确定电路,配置为根据所述像素位置、预设的偏移参数以及预设的曲面参数,确定目标函数,求解电路,配置为根据所述偏移参数对应的预设取值区间,对所述目标函数进行最小化求解,得到所述曲面参数的参数值,其中所述偏移参数为所述目标函数中的自变量;第三确定电路,配置为根据所述曲面参数的参数值,确定所述指定偏移量;相加电路,配置为将所述像素位置与所述指定偏移量相加,得到所述亚像素位置。
在一种可能的实现方式中,所述目标函数是根据曲面函数与距离函数之间的差异构建的,所述距离函数是根据所述像素位置与所述偏移参数构建的,所述曲面函数是根据所述曲面参数与所述偏移参数构建的。
在一种可能的实现方式中,所述曲面参数包括第一参数与第二参数,所述第一参数为2×2的矩阵,所述第二参数为2×1的向量,所述第一参数的参数值包括所述矩阵中对角线上的两个第一元素值,所述第二参数的参数值包括所述向量中的两个第二元素值,其中,所述第三确定电路,配置为根据所述两个第一元素值与所述两个第二元素值,确定纵轴偏移量与横轴偏移量,所述指定偏移量包括所述纵轴偏移量与横轴偏移量。
在一种可能的实现方式中,所述第s *尺度的原始帧事件特征图的尺寸是所述第(S-S *)尺度的预估帧事件特征图的n倍,其中,所述调整子单元,包括:裁剪电路,配置为以每一个所述亚像素位置为中心,从所述第s *尺度的原始帧事件特征图上裁切出I×I个、n×n尺寸的特征图块;拼接电路,配置为根据I×I个所述亚像素位置,对所述I×I个、n×n尺寸的特征图块进行尺寸拼接,得到所述第s *尺度的整合特征图,所述第s *尺度的整合特征图与所述第s *尺度的原始帧事件特征图的尺寸相同。
在一种可能的实现方式中,所述原始视频帧包括至少两帧,第s *尺度的整合特征图包括至少两个,其中,所述优化单元,包括:第三确定子单元,配置为根据第s *尺度的预估帧事件特征图以及至少两个第s *尺度的整合特征图,确定第s *尺度的目标整合特征图;优化子单元,配置为根据S *个尺度的目标整合特征图、所述预估帧事件特征图以及所述融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧。
在一种可能的实现方式中,所述第三确定子单元,包括:第四确定电路,配置为针对所述第s *尺度的预估帧事件特征图中的任一个第二像素点,从所述至少两个第s *尺度的整合特征图中,确定出与所述第二像素点匹配的目标匹配像素点;生成电路,配置为根据各个与所述第二像素点匹配的目标匹配像素点处的特征信息,生成所述第s *尺度的目标整合特征图。
在一种可能的实现方式中,所述第四确定电路,配置为:针对任一个第s *尺度的整合特征图,根据所述第二像素点与所述第s *尺度的整合特征图中各个像素点之间的特征相似度,从所述第s *尺度的整合特征图中确定出与所述第二像素点匹配的第二匹配像素点;根据至少两个所述第二匹配像素点各自对应的特征相似度,将至少两个所述第二匹配像素点中特征相似度最大的第二匹配像素点,确定为与所述第二像素点匹配的目标匹配像素点。
在一种可能的实现方式中,所述优化子单元,包括:第二融合电路,配置为根据第(S-S *)尺度的目标整合特征图、第(S-S *)尺度的预估帧事件特征图以及第(S-S *)尺度的融合特征图,得到第(S-S *)尺度的目标融合特征图;第三融合电路,配置为对第(s *-1)尺度的目标融合特征图、第s *尺度的目标整合特征图以及第s *尺度的融合特征图进行特征融合,得到第s *尺度的目标融合特征图;提取电路,配置为提取第s *尺度的目标融合特征图中的残差特征,得到第s *尺度的残差特征图;解码电路,配置为对第S尺度的残差特征图进行解码处理,得到解码后的残差信息;叠加电路,配置为将所述残差信息叠加至所述预估待插帧中,得到所述目标待插帧。
在一种可能的实现方式中,所述第二融合电路,配置为:提取所述第(S-S *)尺度的预估帧事件特征图的残差特征,得到第(S-S *)尺度的残差特征图;将所述第(S-S *)尺度的残差特征图、所述第(S-S *)尺度的目标整合特征图以及所述第S-S *尺度的融合特征图进行通道拼接,得到目标拼接特征图;对所述目标拼接特征图进行滤波处理,得到所述第(S-S *)尺度的目标融合特征图。
在一种可能的实现方式中,所述获取模块,包括:初始生成子模块,根配置为据指定的插帧时刻,以及所述待处理视频中与所述插帧时刻相邻的原始视频帧,生成所述初始待插帧,所述待处理 视频是事件相机采集的;事件信息生成子模块,配置为根据所述事件相机在所述插帧时刻对应的时间区间内所采集的事件信号,确定所述第一事件信息,所述事件信号用于表征所述事件相机所拍摄物体上亮度发生变化的采集点、在所述时间区间内的亮度变化程度。
在一种可能的实现方式中,所述事件信息生成子模块,包括:划分单元,配置为将所述时间区间内所采集的事件信号划分为M组事件信号,M为正整数;筛除单元,配置为针对第m组事件信号,按照预设的信号过滤区间,从所述第m组事件信号中筛除处于所述信号过滤区间外的事件信号,得到第m组目标事件信号,m∈[1,M];累加单元,配置为根据所述第m组目标事件信号中、各个目标事件信号的极性以及信号位置,将同一信号位置处的目标事件信号进行累加,得到第m个子事件信息,所述信号位置用于表征与所述目标事件信号对应的采集点、在所述事件相机的成像平面中的坐标位置;其中,所述第一事件信息包括M个子事件信息。
在一种可能的实现方式中,所述视频插帧装置是通过图像处理网络实现的,所述图像处理网络包括互补信息融合网络与亚像素运动注意力网络,所述互补信息融合网络包括双分支特征提取子网络与多尺度自适应融合子网络;其中,所述特征提取模块,配置为:通过所述双分支特征提取子网络,分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图。
在一种可能的实现方式中,所述预估帧生成子模块,配置为通过所述多尺度自适应融合子网络,根据所述初始帧特征图与所述事件特征图,生成预估待插帧;和/或,所述预估帧优化子模块,配置为通过所述亚像素运动注意力网络,根据与所述初始待插帧相邻的原始视频帧以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧。
在一种可能的实现方式中,所述装置还包括:网络训练模块,配置为根据样本视频,训练初始图像处理网络,得到所述图像处理网络,所述样本视频包括样本中间帧以及与所述样本中间帧相邻的样本视频帧;其中,所述网络训练模块,包括:中间帧生成子模块,配置为根据样本中间帧对应的中间时刻以及所述样本视频帧,生成初始中间帧;输入子模块,配置为将所述样本视频帧以及所述初始中间帧输入至所述初始图像处理网络中,得到所述初始图像处理网络输出的预测中间帧;更新子模块,配置为根据所述预测中间帧与所述样本中间帧之间的损失,更新所述初始图像处理网络的网络参数至所述损失满足预设条件,得到所述图像处理网络。
在一种可能的实现方式中,所述初始图像处理网络包括初始互补信息融合网络与初始亚像素运动注意力网络,所述预测中间帧包括:所述初始互补信息融合网络输出的第一预测中间帧,以及所述初始亚像素运动注意力网络输出的第二预测中间帧;其中,所述更新子模块,包括:第一更新单元,配置为根据所述第一预测中间帧与所述样本中间帧之间的第一损失,更新所述初始互补信息融合网络的网络参数至所述第一损失收敛,得到所述互补信息融合网络;输入单元,配置为将所述互补信息融合网络输出的样本预测中间帧,输入至所述初始亚像素运动注意力网络,得到所述第二预测中间帧;第二更新单元,配置为根据所述第二预测中间帧与所述样本中间帧之间的第二损失,更新所述初始亚像素运动注意力网络的网络参数至所述第二损失收敛,得到所述亚像素运动注意力网络。
在本公开实施例中,能够实现利用表征初始待插帧中物体的运动轨迹的第一事件信息,对待处理视频的初始待插帧进行优化,使生成的目标待插帧的图像质量高于初始待插帧,从而提高处理后视频的画面质量,有利于降低处理后视频中画面的抖动与扭曲等。
在一些实施例中,本公开实施例提供的装置具有的功能或包含的模块可以用于执行上文方法实施例描述的方法,其具体实现可以参照上文方法实施例的描述,为了简洁,这里不再赘述。
本公开实施例还提出一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现上述方法。计算机可读存储介质可以是易失性或非易失性计算机可读存储介质。
本公开实施例还提出一种电子设备,包括:处理器;配置为存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行上述方法。
本公开实施例还可以提供一种计算机程序,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行上述任一实施例的方法。
本公开实施例还提供了一种计算机程序产品,包括计算机可读代码,或者承载有计算机可读代码的非易失性计算机可读存储介质,在所述计算机可读代码在电子设备的处理器中运行的情况下,所述电子设备中的处理器执行上述方法。
电子设备可以被提供为终端、服务器或其它形态的设备。
图6示出根据本公开实施例的一种电子设备800的框图。例如,电子设备800可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理等终端。
参照图6,电子设备800可以包括以下一个或多个组件:处理组件802,存储器804,电源组件806,多媒体组件808,音频组件810,输入/输出(Input/Output,I/O)的接口812,传感器组件814,以及通信组件816。
处理组件802通常控制电子设备800的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件802可以包括一个或多个处理器820来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件802可以包括一个或多个模块,便于处理组件802和其他组件之间的交互。例如,处理组件802可以包括多媒体模块,以方便多媒体组件808和处理组件802之间的交互。
存储器804被配置为存储各种类型的数据以支持在电子设备800的操作。这些数据的示例包括用于在电子设备800上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器804可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(Static Random-Access Memory,SRAM),电可擦除可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM),可擦除可编程只读存储器(Erasable Programmable Read-Only Memory,EPROM),可编程只读存储器(Programmable Read-Only Memory,PROM),只读存储器(Read Only Memory,ROM),磁存储器,快闪存储器,磁盘或光盘。
电源组件806为电子设备800的各种组件提供电力。电源组件806可以包括电源管理系统,一个或多个电源,及其他与为电子设备800生成、管理和分配电力相关联的组件。
多媒体组件808包括在所述电子设备800和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(Liquid Crystal Display,LCD)和触摸面板(TouchPanel,TP)。在屏幕包括触摸面板的情况下,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件808包括一个前置摄像头和/或后置摄像头。在电子设备800处于操作模式,如拍摄模式或视频模式的情况下,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。
音频组件810被配置为输出和/或输入音频信号。例如,音频组件810包括一个麦克风(Micphone,MIC),在电子设备800处于操作模式,如呼叫模式、记录模式和语音识别模式的情况下,麦克风被配置为接收外部音频信号。所接收的音频信号可以被存储在存储器804或经由通信组件816发送。在一些实施例中,音频组件810还包括一个扬声器,配置为输出音频信号。
I/O接口812为处理组件802和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。
传感器组件814包括一个或多个传感器,配置为为电子设备800提供各个方面的状态评估。例如,传感器组件814可以检测到电子设备800的打开/关闭状态,组件的相对定位,例如所述组件为电子设备800的显示器和小键盘,传感器组件814还可以检测电子设备800或电子设备800一个组件的位置改变,用户与电子设备800接触的存在或不存在,电子设备800方位或加速/减速和电子设备800的温度变化。传感器组件814可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件814还可以包括光传感器,如互补金属氧化物半导体(Complementary Metal Oxide Semiconductor,CMOS)或电荷耦合装置(Charge-coupled Device,CCD)图像传感器,配置为在成像应用中使用。在一些实施例中,该传感器组件814还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。
通信组件816被配置为便于电子设备800和其他设备之间有线或无线方式的通信。电子设备800可以接入基于通信标准的无线网络,如无线网络(WiFi),第二代移动通信技术(2G)或第三代移动通信技术(3G),或它们的组合。在一个示例性实施例中,通信组件816经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件816还包括近场通信(Near Field Communication,NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(Radio Frequency Identification,RFID)技术,红外数据协会(infrared data association,IrDA)技术,超宽带(Ultra Wide Band,UWB)技术,蓝牙(BlueTooth,BT)技术和其他技术来实现。
在示例性实施例中,电子设备800可以被一个或多个应用专用集成电路(Application Specific  Integrated Circuit,ASIC)、数字信号处理器(Digital Signal Processing,DSP)、数字信号处理设备(Digital Signal Processing Device,DSPD)、可编程逻辑器件(Programmable Logic Device,PLD)、现场可编程门阵列(Field Programmable Gate Array,FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器804,上述计算机程序指令可由电子设备800的处理器820执行以完成上述方法。
图7示出根据本公开实施例的一种电子设备1900的框图。例如,电子设备1900可以被提供为一服务器。参照图7,电子设备1900包括处理组件1922,其包括一个或多个处理器,以及由存储器1932所代表的存储器资源,配置为存储可由处理组件1922的执行的指令,例如应用程序。存储器1932中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件1922被配置为执行指令,以执行上述方法。
电子设备1900还可以包括一个电源组件1926被配置为执行电子设备1900的电源管理,一个有线或无线网络接口1950被配置为将电子设备1900连接到网络,和一个输入输出(I/O)接口1958。电子设备1900可以操作基于存储在存储器1932的操作系统,例如微软服务器操作系统(Windows ServerTM),苹果公司推出的基于图形用户界面操作系统(Mac OS XTM),多用户多进程的计算机操作系统(UnixTM),自由和开放原代码的类Unix操作系统(LinuxTM),开放原代码的类Unix操作系统(FreeBSDTM)或类似。
在示例性实施例中,还提供了一种非易失性计算机可读存储介质,例如包括计算机程序指令的存储器1932,上述计算机程序指令可由电子设备1900的处理组件1922执行以完成上述方法。
本公开可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质,其上载有用于使处理器实现本公开的各个方面的计算机可读程序指令。
计算机可读存储介质可以是可以保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以是(但不限于)电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:便携式计算机盘、硬盘、随机存取存储器(Random Access Memory,RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(Compact Disc Read-Only Memory CD-ROM)、数字多功能盘(Digital Video Disc,DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。这里所使用的计算机可读存储介质不被解释为瞬时信号本身,诸如无线电波或者其他自由传播的电磁波、通过波导或其他传输媒介传播的电磁波(例如,通过光纤电缆的光脉冲)、或者通过电线传输的电信号。
这里所描述的计算机可读程序指令可以从计算机可读存储介质下载到各个计算/处理设备,或者通过网络、例如因特网、局域网、广域网和/或无线网下载到外部计算机或外部存储设备。网络可以包括铜传输电缆、光纤传输、无线传输、路由器、防火墙、交换机、网关计算机和/或边缘服务器。每个计算/处理设备中的网络适配卡或者网络接口从网络接收计算机可读程序指令,并转发该计算机可读程序指令,以供存储在各个计算/处理设备中的计算机可读存储介质中。
用于执行本公开操作的计算机程序指令可以是汇编指令、指令集架构(Instruction Set Architecture,ISA)指令、机器指令、机器相关指令、微代码、固件指令、状态设置数据、或者以一种或多种编程语言的任意组合编写的源代码或目标代码,所述编程语言包括面向对象的编程语言—诸如Smalltalk、C++等,以及常规的过程式编程语言—诸如“C”语言或类似的编程语言。计算机可读程序指令可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络—包括局域网(Local Area Network,LAN)或广域网(Wide Area Network,WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。在一些实施例中,通过利用计算机可读程序指令的状态信息来个性化定制电子电路,例如可编程逻辑电路、现场可编程门阵列(FPGA)或可编程逻辑阵列(Programmable Logic Array,PLA),该电子电路可以执行计算机可读程序指令,从而实现本公开的各个方面。
这里参照根据本公开实施例的方法、装置(系统)和计算机程序产品的流程图和/或框图描述了本公开的各个方面。应当理解,流程图和/或框图的每个方框以及流程图和/或框图中各方框的组合,都可以由计算机可读程序指令实现。
这些计算机可读程序指令可以提供给通用计算机、专用计算机或其它可编程数据处理装置的处理器,从而生产出一种机器,使得这些指令在通过计算机或其它可编程数据处理装置的处理器执行时,产生了实现流程图和/或框图中的一个或多个方框中规定的功能/动作的装置。也可以把这些计算机可读程序指令存储在计算机可读存储介质中,这些指令使得计算机、可编程数据处理装置和/或其他设备以特定方式工作,从而,存储有指令的计算机可读介质则包括一个制造品,其包括实现流程图和/或框图中的一个或多个方框中规定的功能/动作的各个方面的指令。
也可以把计算机可读程序指令加载到计算机、其它可编程数据处理装置、或其它设备上,使得在计算机、其它可编程数据处理装置或其它设备上执行一系列操作步骤,以产生计算机实现的过程,从而使得在计算机、其它可编程数据处理装置、或其它设备上执行的指令实现流程图和/或框图中的一个或多个方框中规定的功能/动作。
附图中的流程图和框图显示了根据本公开的多个实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段或指令的一部分,所述模块、程序段或指令的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个连续的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或动作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
该计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。
工业实用性
本公开实施例公开了一种视频插帧方法、装置、电子设备、存储介质、程序和程序产品,所述方法包括:获取待处理视频对应的初始待插帧,以及初始待插帧对应的第一事件信息,第一事件信息用于表征初始待插帧中物体的运动轨迹;分别对初始待插帧以及第一事件信息进行特征提取,得到初始待插帧对应的初始帧特征图以及第一事件信息对应的事件特征图;根据初始帧特征图与事件特征图,生成目标待插帧;将目标待插帧插入至待处理视频中,得到处理后视频。本公开实施例可实现提高处理后视频的画面质量。

Claims (58)

  1. 一种视频插帧方法,包括:
    获取待处理视频对应的初始待插帧,以及所述初始待插帧对应的第一事件信息,所述第一事件信息用于表征所述初始待插帧中物体的运动轨迹;
    分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图;
    根据所述初始帧特征图与所述事件特征图,生成目标待插帧;
    将所述目标待插帧插入至所述待处理视频中,得到处理后视频。
  2. 根据权利要求1所述的方法,其中,所述根据所述初始帧特征图与所述事件特征图,生成目标待插帧,包括:
    根据所述初始帧特征图与所述事件特征图,生成预估待插帧;根据所述待处理视频中、与所述初始待插帧的插帧时刻相邻的原始视频帧,以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧,所述第二事件信息用于表征所述原始视频帧中物体的运动轨迹。
  3. 根据权利要求2所述的方法,其中,所述初始帧特征图包括S个尺度,所述事件特征图包括S个尺度,S为正整数,其中,所述根据所述初始帧特征图与所述事件特征图,生成预估待插帧,包括:
    根据第0尺度的初始帧特征图与第0尺度的事件特征图,得到第0尺度的融合特征图;根据第(s-1)尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图;根据所述第(s-1)尺度的融合特征图、所述第s尺度的可融合初始帧特征图以及所述第s尺度的可融合事件特征图,得到第s尺度的融合特征图;对第(S-1)尺度的融合特征图进行解码处理,得到所述预估待插帧;其中,s∈[1,S)。
  4. 根据权利要求3所述的方法,其中,所述根据第(s-1)尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到第s尺度的可融合初始帧特征图以及第s尺度的可融合事件特征图,包括:
    对所述第(s-1)尺度的融合特征图进行上采样,得到上采样特征图,所述上采样特征图与所述第s尺度的初始帧特征图以及所述第s尺度的事件特征图的尺寸相同;根据所述上采样特征图与所述第s尺度的初始帧特征图之间的第一空间转换关系,得到所述第s尺度的可融合初始帧特征图;根据所述上采样特征图与所述第s尺度的事件特征图之间的第二空间转换关系,得到所述第s尺度的可融合事件特征图;其中,所述第s尺度的可融合初始帧特征图、所述第s尺度的可融合事件特征图与所述上采样特征图处于同一特征空间中。
  5. 根据权利要求4所述的方法,其中,所述第一空间转换关系是根据所述第s尺度的初始帧特征图在空间转换时的第一像素尺寸缩放信息与第一偏置信息,以及所述上采样特征图的特征信息确定的;所述第二空间转换关系是根据所述第s尺度的事件特征图在空间转换时的第二像素尺寸缩放信息与第二偏置信息,以及所述上采样特征图的特征信息确定的;其中,像素尺寸缩放信息表示空间转换中每个像素点的尺寸缩放比例,偏置信息表示空间转换中每个像素点的位置偏移量。
  6. 根据权利要求3至5任一项所述的方法,其中,所述根据所述第s-1尺度的融合特征图、所述第s尺度的可融合初始帧特征图以及所述第s尺度的可融合事件特征图,得到第s尺度的融合特征图,包括:
    对上采样特征图进行卷积处理以及非线性处理,得到所述上采样特征图对应的掩码图,其中,所述上采样特征图是对所述第(s-1)尺度的融合特征图进行上采样得到的;根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到所述第s尺度的融合特征图。
  7. 根据权利要求6所述的方法,其中,所述根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到所述第s尺度的融合特征图,包括:
    根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图;对所述第s尺度的初始融合特征图进行卷积处理以及非线性处理,得到所述第s尺度的融合特征图。
  8. 根据权利要求6或7所述的方法,其中,所述根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图,包括:
    计算所述掩码图与所述第s尺度的可融合事件特征图之间的哈达玛积;根据所述掩码图对应的反向掩码图,计算所述反向掩码图与所述第s尺度的可融合初始帧特征图之间的乘积;将所述哈达玛积与所述乘积相加,得到所述第s尺度的初始融合特征图。
  9. 根据权利要求3所述的方法,其中,所述根据第0尺度的初始帧特征图与第0尺度的事件特征图,得到第0尺度的融合特征图,包括:
    将所述第0尺度的初始帧特征图与所述第0尺度的事件特征图进行通道拼接,得到拼接特征图;对所述拼接特征图进行滤波处理,得到所述第0尺度的融合特征图。
  10. 根据权利要求2所述的方法,其中,所述根据所述待处理视频中、与所述初始待插帧的插帧时刻相邻的原始视频帧,以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧,包括:
    将所述预估待插帧与所述第一事件信息进行组合,得到预估帧事件组合信息;将所述原始视频帧与所述第二事件信息进行组合,得到原始帧事件组合信息;分别对所述预估帧事件组合信息与所述原始帧事件组合信息进行特征提取,得到所述预估帧事件组合信息对应的预估帧事件特征图以及所述原始帧事件组合信息对应的原始帧事件特征图;根据所述预估帧事件特征图,对所述原始帧事件特征图进行调整,得到整合特征图;根据所述整合特征图、所述预估帧事件特征图以及融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧,所述融合特征图是对所述初始帧特征图与所述事件特征图进行多尺度融合得到的。
  11. 根据权利要求10所述的方法,其中,所述预估帧事件特征图包括S *个尺度,所述原始帧事件特征图包括S *个尺度,1≤S *≤S,S *为正整数,s *∈[(S-S *),S),第(S-S *)尺度的预估帧事件特征图的尺寸为I×I,I为正整数,
    其中,所述根据所述预估帧事件特征图,对所述原始帧事件特征图进行调整,得到整合特征图,包括:
    针对第(S-S *)尺度的预估帧事件特征图中的任一个第一像素点,从第(S-S *)尺度的原始帧事件特征图中确定出与所述第一像素点匹配的第一匹配像素点;根据所述第一匹配像素点的像素位置以及指定偏移量,确定与所述像素位置对应的亚像素位置,所述指定偏移量为小数;根据I×I个所述亚像素位置,对第s *尺度的原始帧事件特征图进行调整,得到第s *尺度的整合特征图。
  12. 根据权利要求11所述的方法,其中,所述针对第(S-S *)尺度的预估帧事件特征图中的任一个第一像素点,从第(S-S *)尺度的原始帧事件特征图中确定出与所述第一像素点匹配的第一匹配像素点,包括:
    针对任一个第一像素点,计算所述第一像素点分别与所述第(S-S *)尺度的原始帧事件特征图中、在指定窗口内的各个像素点之间的特征相似度,所述指定窗口是根据所述第一像素点的像素位置确定的;将所述指定窗口内的各个像素点中的、最大特征相似度所对应的像素点,确定为所述第一匹配像素点。
  13. 根据权利要求11或12所述的方法,其中,所述根据所述第一匹配像素点的像素位置以及指定偏移量,确定与所述像素位置对应的亚像素位置,包括:
    根据所述像素位置、预设的偏移参数以及预设的曲面参数,确定目标函数,根据所述偏移参数对应的预设取值区间,对所述目标函数进行最小化求解,得到所述曲面参数的参数值,其中所述偏移参数为所述目标函数中的自变量;根据所述曲面参数的参数值,确定所述指定偏移量;将所述像素位置与所述指定偏移量相加,得到所述亚像素位置。
  14. 根据权利要求13所述的方法,其中,所述目标函数是根据曲面函数与距离函数之间的差异构建的,所述距离函数是根据所述像素位置与所述偏移参数构建的,所述曲面函数是根据所述曲面参数与所述偏移参数构建的。
  15. 根据权利要求13或14所述的方法,其中,所述曲面参数包括第一参数与第二参数,所述第一参数为2×2的矩阵,所述第二参数为2×1的向量,所述第一参数的参数值包括所述矩阵中对角线上的两个第一元素值,所述第二参数的参数值包括所述向量中的两个第二元素值,
    其中,所述根据所述曲面参数的参数值,确定所述指定偏移量,包括:根据所述两个第一元素值与所述两个第二元素值,确定纵轴偏移量与横轴偏移量,所述指定偏移量包括所述纵轴偏移量与横轴偏移量。
  16. 根据权利要求11至15任一项所述的方法,其中,所述第s *尺度的原始帧事件特征图的尺寸是所述第(S-S *)尺度的预估帧事件特征图的n倍,其中,所述根据I×I个所述亚像素位置,对第s *尺度的原始帧事件特征图进行调整,得到第s *尺度的整合特征图,包括:
    以每一个所述亚像素位置为中心,从所述第s *尺度的原始帧事件特征图上裁切出I×I个、n×n尺寸的特征图块;根据I×I个所述亚像素位置,对所述I×I个、n×n尺寸的特征图块进行尺寸拼接,得到所述第s *尺度的整合特征图,所述第s *尺度的整合特征图与所述第s *尺度的原始帧事件特征图的尺寸相同。
  17. 根据权利要求10至16任一项所述的方法,其中,所述原始视频帧包括至少两帧,第s *尺度的整合特征图包括至少两个,
    其中,所述根据所述整合特征图、所述预估帧事件特征图以及融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧,包括:
    根据第s *尺度的预估帧事件特征图以及至少两个第s *尺度的整合特征图,确定第s *尺度的目标整合特征图;根据S *个尺度的目标整合特征图、所述预估帧事件特征图以及所述融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧。
  18. 根据权利要求17所述的方法,其中,根据第s *尺度的预估帧事件特征图以及至少两个第s *尺度的整合特征图,确定第s *尺度的目标整合特征图,包括:
    针对所述第s *尺度的预估帧事件特征图中的任一个第二像素点,从所述至少两个第s *尺度的整合特征图中,确定出与所述第二像素点匹配的目标匹配像素点;根据各个与所述第二像素点匹配的目标匹配像素点处的特征信息,生成所述第s *尺度的目标整合特征图。
  19. 根据权利要求17或18所述的方法,其中,所述针对所述第s *尺度的预估帧事件特征图中的任一个第二像素点,从所述至少两个第s *尺度的整合特征图中,确定出与所述第二像素点匹配的目标匹配像素点,包括:
    针对任一个第s *尺度的整合特征图,根据所述第二像素点与所述第s *尺度的整合特征图中各个像素点之间的特征相似度,从所述第s *尺度的整合特征图中确定出与所述第二像素点匹配的第二匹配像素点;根据至少两个所述第二匹配像素点各自对应的特征相似度,将至少两个所述第二匹配像素点中特征相似度最大的第二匹配像素点,确定为与所述第二像素点匹配的目标匹配像素点。
  20. 根据权利要求17至19任一项所述的方法,其中,所述根据S *个尺度的目标整合特征图、所述预估帧事件特征图以及所述融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧,包括:
    根据第(S-S *)尺度的目标整合特征图、第(S-S *)尺度的预估帧事件特征图以及第(S-S *)尺度的融合特征图,得到第(S-S *)尺度的目标融合特征图;对第(s *-1)尺度的目标融合特征图、第s *尺度的目标整合特征图以及第s *尺度的融合特征图进行特征融合,得到第s *尺度的目标融合特征图;提取第s *尺度的目标融合特征图中的残差特征,得到第s *尺度的残差特征图;对第S尺度的残差特征图进行解码处理,得到解码后的残差信息;将所述残差信息叠加至所述预估待插帧中,得到所述目标待插帧。
  21. 根据权利要求20所述的方法,其中,所述根据第(S-S *)尺度的目标整合特征图、第(S-S *)尺度的预估帧事件特征图以及第(S-S *)尺度的融合特征图,得到第(S-S *)尺度的目标融合特征图,包括:
    提取所述第(S-S *)尺度的预估帧事件特征图的残差特征,得到第(S-S *)尺度的残差特征图;将所述第(S-S *)尺度的残差特征图、所述第(S-S *)尺度的目标整合特征图以及所述第S-S *尺度的融合特征图进行通道拼接,得到目标拼接特征图;对所述目标拼接特征图进行滤波处理,得到所述第(S-S *)尺度的目标融合特征图。
  22. 根据权利要求1至21任一项所述的方法,其中,所述获取待处理视频对应的初始待插帧,以及所述初始待插帧对应的第一事件信息,包括:
    根据指定的插帧时刻,以及所述待处理视频中与所述插帧时刻相邻的原始视频帧,生成所述初始待插帧,所述待处理视频是事件相机采集的;根据所述事件相机在所述插帧时刻对应的时间区间内所采集的事件信号,确定所述第一事件信息,所述事件信号用于表征所述事件相机所拍摄物体上亮度发生变化的采集点、在所述时间区间内的亮度变化程度。
  23. 根据权利要求22所述的方法,其中,所述根据所述事件相机在所述插帧时刻对应的时间区间内所采集的事件信号,确定所述第一事件信息,包括:
    将所述时间区间内所采集的事件信号划分为M组事件信号,M为正整数;针对第m组事件信 号,按照预设的信号过滤区间,从所述第m组事件信号中筛除处于所述信号过滤区间外的事件信号,得到第m组目标事件信号,m∈[1,M];根据所述第m组目标事件信号中、各个目标事件信号的极性以及信号位置,将同一信号位置处的目标事件信号进行累加,得到第m个子事件信息,所述信号位置用于表征与所述目标事件信号对应的采集点、在所述事件相机的成像平面中的坐标位置;其中,所述第一事件信息包括M个子事件信息。
  24. 根据权利要求1至23任一项所述的方法,其中,所述视频插帧方法是通过图像处理网络实现的,所述图像处理网络包括互补信息融合网络与亚像素运动注意力网络,所述互补信息融合网络包括双分支特征提取子网络与多尺度自适应融合子网络;
    其中,所述分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图,包括:
    通过所述双分支特征提取子网络,分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图。
  25. 根据权利要求24所述的方法,其中,所述根据所述初始帧特征图与所述事件特征图,生成预估待插帧,包括:
    通过所述多尺度自适应融合子网络,根据所述初始帧特征图与所述事件特征图,生成预估待插帧;和/或,所述根据与所述初始待插帧相邻的原始视频帧以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧,包括:通过所述亚像素运动注意力网络,根据与所述初始待插帧相邻的原始视频帧以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧。
  26. 根据权利要求24或25所述的方法,其中,所述方法还包括:
    根据样本视频,训练初始图像处理网络,得到所述图像处理网络,所述样本视频包括样本中间帧以及与所述样本中间帧相邻的样本视频帧;
    其中,所述根据样本视频,训练初始图像处理网络,得到所述图像处理网络,包括:
    根据样本中间帧对应的中间时刻以及所述样本视频帧,生成初始中间帧;将所述样本视频帧以及所述初始中间帧输入至所述初始图像处理网络中,得到所述初始图像处理网络输出的预测中间帧;根据所述预测中间帧与所述样本中间帧之间的损失,更新所述初始图像处理网络的网络参数至所述损失满足预设条件,得到所述图像处理网络。
  27. 根据权利要求26所述的方法,其中,所述初始图像处理网络包括初始互补信息融合网络与初始亚像素运动注意力网络,所述预测中间帧包括:所述初始互补信息融合网络输出的第一预测中间帧,以及所述初始亚像素运动注意力网络输出的第二预测中间帧;
    其中,所述根据所述预测中间帧与所述样本中间帧之间的损失,更新所述初始图像处理网络的网络参数至所述损失满足预设条件,得到所述图像处理网络,包括:
    根据所述第一预测中间帧与所述样本中间帧之间的第一损失,更新所述初始互补信息融合网络的网络参数至所述第一损失收敛,得到所述互补信息融合网络;将所述互补信息融合网络输出的样本预测中间帧,输入至所述初始亚像素运动注意力网络,得到所述第二预测中间帧;根据所述第二预测中间帧与所述样本中间帧之间的第二损失,更新所述初始亚像素运动注意力网络的网络参数至所述第二损失收敛,得到所述亚像素运动注意力网络。
  28. 一种视频插帧装置,包括:
    获取模块,配置为获取待处理视频对应的初始待插帧,以及所述初始待插帧对应的第一事件信息,所述第一事件信息用于表征所述初始待插帧中物体的运动轨迹;
    特征提取模块,配置为分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图;
    生成模块,配置为根据所述初始帧特征图与所述事件特征图,生成目标待插帧;
    插帧模块,配置为将所述目标待插帧插入至所述待处理视频中,得到处理后视频。
  29. 根据权利要求28所述的装置,其中,所述生成模块,包括:
    预估帧生成子模块,配置为根据所述初始帧特征图与所述事件特征图,生成预估待插帧;预估帧优化子模块,配置为根据所述待处理视频中、与所述初始待插帧的插帧时刻相邻的原始视频帧,以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧,所述第二事件信息用于表征所述原始视频帧中物体的运动轨迹。
  30. 根据权利要求29所述的装置,其中,所述初始帧特征图包括S个尺度,所述事件特征图包括S个尺度,S为正整数,其中,所述预估帧生成子模块,包括:
    第一融合单元,配置为根据第0尺度的初始帧特征图与第0尺度的事件特征图,得到第0尺度的融合特征图;对齐单元,配置为根据第(s-1)尺度的融合特征图,将第s尺度的初始帧特征图与第s尺度的事件特征图进行空间对齐,得到第s尺度的可融合初始帧特征图与第s尺度的可融合事件特征图;第二融合单元,配置为根据所述第(s-1)尺度的融合特征图、所述第s尺度的可融合初始帧特征图以及所述第s尺度的可融合事件特征图,得到第s尺度的融合特征图;解码单元,配置为对第(S-1)尺度的融合特征图进行解码处理,得到所述预估待插帧;其中,s∈[1,S)。
  31. 根据权利要求30所述的装置,其中,所述对齐单元,包括:
    上采样子单元,配置为对所述第(s-1)尺度的融合特征图进行上采样,得到上采样特征图,所述上采样特征图与所述第s尺度的初始帧特征图以及所述第s尺度的事件特征图的尺寸相同;第一转换子单元,配置为根据所述上采样特征图与所述第s尺度的初始帧特征图之间的第一空间转换关系,得到所述第s尺度的可融合初始帧特征图;第二转换子单元,配置为根据所述上采样特征图与所述第s尺度的事件特征图之间的第二空间转换关系,得到所述第s尺度的可融合事件特征图;其中,所述第s尺度的可融合初始帧特征图、所述第s尺度的可融合事件特征图与所述上采样特征图处于同一特征空间中。
  32. 根据权利要求31所述的装置,其中,所述第一空间转换关系是根据所述第s尺度的初始帧特征图在空间转换时的第一像素尺寸缩放信息与第一偏置信息,以及所述上采样特征图的特征信息确定的;所述第二空间转换关系是根据所述第s尺度的事件特征图在空间转换时的第二像素尺寸缩放信息与第二偏置信息,以及所述上采样特征图的特征信息确定的;其中,像素尺寸缩放信息表示空间转换中每个像素点的尺寸缩放比例,偏置信息表示空间转换中每个像素点的位置偏移量。
  33. 根据权利要求30至32任一项所述的装置,其中,所述第二融合单元,包括:
    处理子单元,配置为对上采样特征图进行卷积处理以及非线性处理,得到所述上采样特征图对应的掩码图,其中,所述上采样特征图是对所述第(s-1)尺度的融合特征图进行上采样得到的;融合子单元,配置为根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到所述第s尺度的融合特征图。
  34. 根据权利要求33所述的装置,其中,所述融合子单元,包括:第一融合电路,配置为根据所述掩码图,将所述第s尺度的可融合初始帧特征图与所述第s尺度的可融合事件特征图进行特征融合,得到第s尺度的初始融合特征图;处理电路,配置为对所述第s尺度的初始融合特征图进行卷积处理以及非线性处理,得到所述第s尺度的融合特征图。
  35. 根据权利要求33或34所述的装置,其中,所述第一融合电路,配置为计算所述掩码图与所述第s尺度的可融合事件特征图之间的哈达玛积;根据所述掩码图对应的反向掩码图,计算所述反向掩码图与所述第s尺度的可融合初始帧特征图之间的乘积;将所述哈达玛积与所述乘积相加,得到所述第s尺度的初始融合特征图。
  36. 根据权利要求30所述的装置,其中,所述第一融合单元,包括:拼接子单元,配置为将所述第0尺度的初始帧特征图与所述第0尺度的事件特征图进行通道拼接,得到拼接特征图;滤波子单元,配置为对所述拼接特征图进行滤波处理,得到所述第0尺度的融合特征图。
  37. 根据权利要求29所述的装置,其中,所述预估帧优化子模块,包括:
    第一组合单元,配置为将所述预估待插帧与所述第一事件信息进行组合,得到预估帧事件组合信息;第二组合单元,配置为将所述原始视频帧与所述第二事件信息进行组合,得到原始帧事件组合信息;提取单元,配置为分别对所述预估帧事件组合信息与所述原始帧事件组合信息进行特征提取,得到所述预估帧事件组合信息对应的预估帧事件特征图以及所述原始帧事件组合信息对应的原始帧事件特征图;调整单元,配置为根据所述预估帧事件特征图,对所述原始帧事件特征图进行调整,得到整合特征图;优化单元,配置为根据所述整合特征图、所述预估帧事件特征图以及融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧,所述融合特征图是对所述初始帧特征图与所述事件特征图进行多尺度融合得到的。
  38. 根据权利要求37所述的装置,其中,所述预估帧事件特征图包括S *个尺度,所述原始帧事件特征图包括S *个尺度,1≤S *≤S,S *为正整数,s *∈[(S-S *),S),第(S-S *)尺度的预估帧事件特征图的尺寸为I×I,I为正整数,
    其中,所述调整单元,包括:第一确定子单元,配置为针对第(S-S *)尺度的预估帧事件特征图中的任一个第一像素点,从第(S-S *)尺度的原始帧事件特征图中确定出与所述第一像素点匹配的第一匹配像素点;第二确定子单元,配置为根据所述第一匹配像素点的像素位置以及指定偏移量,确定与所述像素位置对应的亚像素位置,所述指定偏移量为小数;调整子单元,配置为根据I×I个 所述亚像素位置,对第s *尺度的原始帧事件特征图进行调整,得到第s *尺度的整合特征图。
  39. 根据权利要求38所述的装置,其中,所述第一确定子单元,包括:
    计算电路,配置为针对任一个第一像素点,计算所述第一像素点分别与所述第(S-S *)尺度的原始帧事件特征图中、在指定窗口内的各个像素点之间的特征相似度,所述指定窗口是根据所述第一像素点的像素位置确定的;第一确定电路,配置为将所述指定窗口内的各个像素点中的、最大特征相似度所对应的像素点,确定为所述第一匹配像素点。
  40. 根据权利要求28或29所述的装置,其中,所述第二确定子单元,包括:
    第二确定电路,配置为根据所述像素位置、预设的偏移参数以及预设的曲面参数,确定目标函数,求解电路,配置为根据所述偏移参数对应的预设取值区间,对所述目标函数进行最小化求解,得到所述曲面参数的参数值,其中所述偏移参数为所述目标函数中的自变量;第三确定电路,配置为根据所述曲面参数的参数值,确定所述指定偏移量;相加电路,配置为将所述像素位置与所述指定偏移量相加,得到所述亚像素位置。
  41. 根据权利要求40所述的装置,其中,所述目标函数是根据曲面函数与距离函数之间的差异构建的,所述距离函数是根据所述像素位置与所述偏移参数构建的,所述曲面函数是根据所述曲面参数与所述偏移参数构建的。
  42. 根据权利要求40或41所述的装置,其中,所述曲面参数包括第一参数与第二参数,所述第一参数为2×2的矩阵,所述第二参数为2×1的向量,所述第一参数的参数值包括所述矩阵中对角线上的两个第一元素值,所述第二参数的参数值包括所述向量中的两个第二元素值,其中,所述第三确定电路,配置为根据所述两个第一元素值与所述两个第二元素值,确定纵轴偏移量与横轴偏移量,所述指定偏移量包括所述纵轴偏移量与横轴偏移量。
  43. 根据权利要求38至42任一项所述的装置,其中,所述第s *尺度的原始帧事件特征图的尺寸是所述第(S-S *)尺度的预估帧事件特征图的n倍,其中,所述调整子单元,包括:
    裁剪电路,配置为以每一个所述亚像素位置为中心,从所述第s *尺度的原始帧事件特征图上裁切出I×I个、n×n尺寸的特征图块;拼接电路,配置为根据I×I个所述亚像素位置,对所述I×I个、n×n尺寸的特征图块进行尺寸拼接,得到所述第s *尺度的整合特征图,所述第s *尺度的整合特征图与所述第s *尺度的原始帧事件特征图的尺寸相同。
  44. 根据权利要求37至43任一项所述的装置,其中,所述原始视频帧包括至少两帧,第s *尺度的整合特征图包括至少两个,其中,所述优化单元,包括:第三确定子单元,配置为根据第s *尺度的预估帧事件特征图以及至少两个第s *尺度的整合特征图,确定第s *尺度的目标整合特征图;优化子单元,配置为根据S *个尺度的目标整合特征图、所述预估帧事件特征图以及所述融合特征图,对所述预估待插帧进行优化,得到所述目标待插帧。
  45. 根据权利要求44所述的装置,其中,所述第三确定子单元,包括:
    第四确定电路,配置为针对所述第s *尺度的预估帧事件特征图中的任一个第二像素点,从所述至少两个第s *尺度的整合特征图中,确定出与所述第二像素点匹配的目标匹配像素点;生成电路,配置为根据各个与所述第二像素点匹配的目标匹配像素点处的特征信息,生成所述第s *尺度的目标整合特征图。
  46. 根据权利要求44或45所述的装置,其中,所述第四确定电路,配置为针对任一个第s *尺度的整合特征图,根据所述第二像素点与所述第s *尺度的整合特征图中各个像素点之间的特征相似度,从所述第s *尺度的整合特征图中确定出与所述第二像素点匹配的第二匹配像素点;根据至少两个所述第二匹配像素点各自对应的特征相似度,将至少两个所述第二匹配像素点中特征相似度最大的第二匹配像素点,确定为与所述第二像素点匹配的目标匹配像素点。
  47. 根据权利要求44至46任一项所述的装置,其中,所述优化子单元,包括:
    第二融合电路,配置为根据第(S-S *)尺度的目标整合特征图、第(S-S *)尺度的预估帧事件特征图以及第(S-S *)尺度的融合特征图,得到第(S-S *)尺度的目标融合特征图;第三融合电路,配置为对第(s *-1)尺度的目标融合特征图、第s *尺度的目标整合特征图以及第s *尺度的融合特征图进行特征融合,得到第s *尺度的目标融合特征图;提取电路,配置为提取第s *尺度的目标融合特征图中的残差特征,得到第s *尺度的残差特征图;解码电路,配置为对第S尺度的残差特征图进行解码处理,得到解码后的残差信息;叠加电路,配置为将所述残差信息叠加至所述预估待插帧中,得到所述目标待插帧。
  48. 根据权利要求47所述的装置,其中,所述第二融合电路,配置为提取所述第(S-S *)尺度的预估帧事件特征图的残差特征,得到第(S-S *)尺度的残差特征图将所述第(S-S *)尺度的残差 特征图、所述第(S-S *)尺度的目标整合特征图以及所述第S-S *尺度的融合特征图进行通道拼接,得到目标拼接特征图;对所述目标拼接特征图进行滤波处理,得到所述第(S-S *)尺度的目标融合特征图。
  49. 根据权利要求28至48任一项所述的装置,其中,所述获取模块,包括:
    初始生成子模块,配置为根据指定的插帧时刻,以及所述待处理视频中与所述插帧时刻相邻的原始视频帧,生成所述初始待插帧,所述待处理视频是事件相机采集的;事件信息生成子模块,配置为根据所述事件相机在所述插帧时刻对应的时间区间内所采集的事件信号,确定所述第一事件信息,所述事件信号用于表征所述事件相机所拍摄物体上亮度发生变化的采集点、在所述时间区间内的亮度变化程度。
  50. 根据权利要求49所述的装置,其中,所述事件信息生成子模块,包括:
    划分单元,配置为将所述时间区间内所采集的事件信号划分为M组事件信号,M为正整数;筛除单元,配置为针对第m组事件信号,按照预设的信号过滤区间,从所述第m组事件信号中筛除处于所述信号过滤区间外的事件信号,得到第m组目标事件信号,m∈[1,M];累加单元,配置为根据所述第m组目标事件信号中、各个目标事件信号的极性以及信号位置,将同一信号位置处的目标事件信号进行累加,得到第m个子事件信息,所述信号位置用于表征与所述目标事件信号对应的采集点、在所述事件相机的成像平面中的坐标位置;其中,所述第一事件信息包括M个子事件信息。
  51. 根据权利要求29至50任一项所述的装置,其中,所述视频插帧方法是通过图像处理网络实现的,所述图像处理网络包括互补信息融合网络与亚像素运动注意力网络,所述互补信息融合网络包括双分支特征提取子网络与多尺度自适应融合子网络;其中,所述特征提取模块,配置为通过所述双分支特征提取子网络,分别对所述初始待插帧以及所述第一事件信息进行特征提取,得到所述初始待插帧对应的初始帧特征图以及所述第一事件信息对应的事件特征图。
  52. 根据权利要求51所述的装置,其中,所述预估帧生成子模块,配置为通过所述多尺度自适应融合子网络,根据所述初始帧特征图与所述事件特征图,生成预估待插帧;和/或,所述预估帧优化子模块,配置为通过所述亚像素运动注意力网络,根据与所述初始待插帧相邻的原始视频帧以及所述原始视频帧对应的第二事件信息,对所述预估待插帧进行优化,得到所述目标待插帧。
  53. 根据权利要求51或52所述的装置,其中,所述装置还包括:
    网络训练模块,配置为根据样本视频,训练初始图像处理网络,得到所述图像处理网络,所述样本视频包括样本中间帧以及与所述样本中间帧相邻的样本视频帧;
    其中,所述网络训练模块,包括:中间帧生成子模块,配置为根据样本中间帧对应的中间时刻以及所述样本视频帧,生成初始中间帧;输入子模块,配置为将所述样本视频帧以及所述初始中间帧输入至所述初始图像处理网络中,得到所述初始图像处理网络输出的预测中间帧;更新子模块,配置为根据所述预测中间帧与所述样本中间帧之间的损失,更新所述初始图像处理网络的网络参数至所述损失满足预设条件,得到所述图像处理网络。
  54. 根据权利要求53所述的装置,其中,所述初始图像处理网络包括初始互补信息融合网络与初始亚像素运动注意力网络,所述预测中间帧包括:所述初始互补信息融合网络输出的第一预测中间帧,以及所述初始亚像素运动注意力网络输出的第二预测中间帧;
    其中,所述更新子模块,包括:第一更新单元,配置为根据所述第一预测中间帧与所述样本中间帧之间的第一损失,更新所述初始互补信息融合网络的网络参数至所述第一损失收敛,得到所述互补信息融合网络;输入单元,配置为将所述互补信息融合网络输出的样本预测中间帧,输入至所述初始亚像素运动注意力网络,得到所述第二预测中间帧;第二更新单元,配置为根据所述第二预测中间帧与所述样本中间帧之间的第二损失,更新所述初始亚像素运动注意力网络的网络参数至所述第二损失收敛,得到所述亚像素运动注意力网络。
  55. 一种电子设备,包括:
    处理器;用于存储处理器可执行指令的存储器;其中,所述处理器被配置为调用所述存储器存储的指令,以执行权利要求1至27中任意一项所述的方法。
  56. 一种计算机可读存储介质,其上存储有计算机程序指令,所述计算机程序指令被处理器执行时实现权利要求1至27中任意一项所述的方法。
  57. 一种计算机程序,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行如权利要求1至27任一所述的方法。
  58. 一种计算机程序产品,包括计算机可读代码,在所述计算机可读代码在电子设备中运行的情况下,所述电子设备中的处理器执行如权利要求1至27任一所述的方法。
PCT/CN2022/079310 2021-09-29 2022-03-04 视频插帧方法、装置、电子设备、存储介质、程序及程序产品 WO2023050723A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111154081.7 2021-09-29
CN202111154081.7A CN113837136B (zh) 2021-09-29 2021-09-29 视频插帧方法及装置、电子设备和存储介质

Publications (1)

Publication Number Publication Date
WO2023050723A1 true WO2023050723A1 (zh) 2023-04-06

Family

ID=78967549

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/079310 WO2023050723A1 (zh) 2021-09-29 2022-03-04 视频插帧方法、装置、电子设备、存储介质、程序及程序产品

Country Status (2)

Country Link
CN (1) CN113837136B (zh)
WO (1) WO2023050723A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116721351A (zh) * 2023-07-06 2023-09-08 内蒙古电力(集团)有限责任公司内蒙古超高压供电分公司 一种架空线路通道内道路环境特征遥感智能提取方法
CN117474993A (zh) * 2023-10-27 2024-01-30 哈尔滨工程大学 水下图像特征点亚像素位置估计方法及装置
CN117765378A (zh) * 2024-02-22 2024-03-26 成都信息工程大学 多尺度特征融合的复杂环境下违禁物品检测方法和装置

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113837136B (zh) * 2021-09-29 2022-12-23 深圳市慧鲤科技有限公司 视频插帧方法及装置、电子设备和存储介质
CN114490671B (zh) * 2022-03-31 2022-07-29 北京华建云鼎科技股份公司 一种客户端同屏的数据同步系统
CN115297313B (zh) * 2022-10-09 2023-04-25 南京芯视元电子有限公司 微显示动态补偿方法及系统
CN116156250A (zh) * 2023-02-21 2023-05-23 维沃移动通信有限公司 视频处理方法及其装置
CN117315574B (zh) * 2023-09-20 2024-06-07 北京卓视智通科技有限责任公司 一种盲区轨迹补全的方法、系统、计算机设备和存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190116386A1 (en) * 2015-10-13 2019-04-18 Sony Corporation Transmission apparatus, transmission method, reception apparatus, and reception method
CN109922372A (zh) * 2019-02-26 2019-06-21 深圳市商汤科技有限公司 视频数据处理方法及装置、电子设备和存储介质
CN113034380A (zh) * 2021-02-09 2021-06-25 浙江大学 一种基于改进可变形卷积校正的视频时空超分辨率方法和装置
CN113837136A (zh) * 2021-09-29 2021-12-24 深圳市慧鲤科技有限公司 视频插帧方法及装置、电子设备和存储介质

Family Cites Families (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20120030813A (ko) * 2010-09-20 2012-03-29 삼성전자주식회사 영상 데이터 처리 방법 및 이를 수행하는 표시 장치
US11568545B2 (en) * 2017-11-20 2023-01-31 A9.Com, Inc. Compressed content object and action detection
CN108830812B (zh) * 2018-06-12 2021-08-31 福建帝视信息科技有限公司 一种基于网格结构深度学习的视频高帧率重制方法
CN109379550B (zh) * 2018-09-12 2020-04-17 上海交通大学 基于卷积神经网络的视频帧率上变换方法及系统
CN111277780B (zh) * 2018-12-04 2021-07-20 阿里巴巴集团控股有限公司 一种改善插帧效果的方法和装置
CN111277895B (zh) * 2018-12-05 2022-09-27 阿里巴巴集团控股有限公司 一种视频插帧方法和装置
CN109922231A (zh) * 2019-02-01 2019-06-21 重庆爱奇艺智能科技有限公司 一种用于生成视频的插帧图像的方法和装置
CN110322525B (zh) * 2019-06-28 2023-05-02 连尚(新昌)网络科技有限公司 一种动图处理方法及终端
CN110324664B (zh) * 2019-07-11 2021-06-04 南开大学 一种基于神经网络的视频补帧方法及其模型的训练方法
CN110751021A (zh) * 2019-09-03 2020-02-04 北京迈格威科技有限公司 图像处理方法、装置、电子设备和计算机可读介质
CN110633700B (zh) * 2019-10-21 2022-03-25 深圳市商汤科技有限公司 视频处理方法及装置、电子设备和存储介质
US11430138B2 (en) * 2020-03-05 2022-08-30 Huawei Technologies Co., Ltd. Systems and methods for multi-frame video frame interpolation
CN111641835B (zh) * 2020-05-19 2023-06-02 Oppo广东移动通信有限公司 视频处理方法、视频处理装置和电子设备
KR102201297B1 (ko) * 2020-05-29 2021-01-08 연세대학교 산학협력단 다중 플로우 기반 프레임 보간 장치 및 방법
CN112771843A (zh) * 2020-06-15 2021-05-07 深圳市大疆创新科技有限公司 信息处理方法、装置和成像系统
CN112104830B (zh) * 2020-08-13 2022-09-27 北京迈格威科技有限公司 视频插帧方法、模型训练方法及对应装置
CN112584234B (zh) * 2020-12-09 2023-06-16 广州虎牙科技有限公司 视频图像的补帧方法及相关装置
CN112596843B (zh) * 2020-12-29 2023-07-25 北京元心科技有限公司 图像处理方法、装置、电子设备及计算机可读存储介质
CN112836652B (zh) * 2021-02-05 2024-04-19 浙江工业大学 一种基于事件相机的多阶段人体姿态估计方法
CN113066014B (zh) * 2021-05-19 2022-09-02 云南电网有限责任公司电力科学研究院 一种图像超分辨方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190116386A1 (en) * 2015-10-13 2019-04-18 Sony Corporation Transmission apparatus, transmission method, reception apparatus, and reception method
CN109922372A (zh) * 2019-02-26 2019-06-21 深圳市商汤科技有限公司 视频数据处理方法及装置、电子设备和存储介质
CN113766313A (zh) * 2019-02-26 2021-12-07 深圳市商汤科技有限公司 视频数据处理方法及装置、电子设备和存储介质
CN113034380A (zh) * 2021-02-09 2021-06-25 浙江大学 一种基于改进可变形卷积校正的视频时空超分辨率方法和装置
CN113837136A (zh) * 2021-09-29 2021-12-24 深圳市慧鲤科技有限公司 视频插帧方法及装置、电子设备和存储介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
QI LIN $, CHEN JING , ZENG HUANQIANG , ZHU JIANQING , CAI CANHUI: " Video super-resolution method based on multi-scale feature residual learning convolutional neural network", JOURNAL OF SIGNAL PROCESSING, vol. 36, no. 1, 25 January 2020 (2020-01-25), pages 50 - 57, XP093053403 *
YU, ZHIYANG ET AL.: "Training Weakly Supervised Video Frame Interpolation with Events", 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION, 17 October 2021 (2021-10-17), pages 14569 - 14578, XP034092750, DOI: 10.1109/ICCV48922.2021.01432 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116721351A (zh) * 2023-07-06 2023-09-08 内蒙古电力(集团)有限责任公司内蒙古超高压供电分公司 一种架空线路通道内道路环境特征遥感智能提取方法
CN117474993A (zh) * 2023-10-27 2024-01-30 哈尔滨工程大学 水下图像特征点亚像素位置估计方法及装置
CN117474993B (zh) * 2023-10-27 2024-05-24 哈尔滨工程大学 水下图像特征点亚像素位置估计方法及装置
CN117765378A (zh) * 2024-02-22 2024-03-26 成都信息工程大学 多尺度特征融合的复杂环境下违禁物品检测方法和装置
CN117765378B (zh) * 2024-02-22 2024-04-26 成都信息工程大学 多尺度特征融合的复杂环境下违禁物品检测方法和装置

Also Published As

Publication number Publication date
CN113837136A (zh) 2021-12-24
CN113837136B (zh) 2022-12-23

Similar Documents

Publication Publication Date Title
WO2023050723A1 (zh) 视频插帧方法、装置、电子设备、存储介质、程序及程序产品
WO2020224457A1 (zh) 图像处理方法及装置、电子设备和存储介质
TWI740309B (zh) 圖像處理方法及裝置、電子設備和電腦可讀儲存介質
CN113766313B (zh) 视频数据处理方法及装置、电子设备和存储介质
WO2021035812A1 (zh) 一种图像处理方法及装置、电子设备和存储介质
WO2021208667A1 (zh) 图像处理方法及装置、电子设备和存储介质
WO2021139120A1 (zh) 网络训练方法及装置、图像生成方法及装置
TWI702544B (zh) 圖像處理方法、電子設備和電腦可讀儲存介質
TWI736179B (zh) 圖像處理方法、電子設備和電腦可讀儲存介質
WO2021012564A1 (zh) 视频处理方法及装置、电子设备和存储介质
TW202107339A (zh) 位姿確定方法、位姿確定裝置、電子設備和電腦可讀儲存媒介
WO2020220807A1 (zh) 图像生成方法及装置、电子设备及存储介质
WO2021208666A1 (zh) 字符识别方法及装置、电子设备和存储介质
WO2020082382A1 (en) Method and system of neural network object recognition for image processing
CN109325908B (zh) 图像处理方法及装置、电子设备和存储介质
WO2023165082A1 (zh) 图像预览方法、装置、电子设备、存储介质及计算机程序及其产品
WO2021056770A1 (zh) 图像重建方法及装置、电子设备和存储介质
WO2024114475A1 (zh) 一种视频转码方法及装置、电子设备、计算机可读存储介质和计算机程序产品
WO2022247091A1 (zh) 人群定位方法及装置、电子设备和存储介质
CN109816620B (zh) 图像处理方法及装置、电子设备和存储介质
US20230098437A1 (en) Reference-Based Super-Resolution for Image and Video Enhancement
CN114581495A (zh) 图像处理方法、视频处理方法、装置以及电子设备
CN117150066B (zh) 汽车传媒领域的智能绘图方法和装置
US20240104686A1 (en) Low-Latency Video Matting
US20240071035A1 (en) Efficient flow-guided multi-frame de-fencing

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE