WO2023005140A1 - 视频数据处理方法、装置、设备以及存储介质 - Google Patents

视频数据处理方法、装置、设备以及存储介质 Download PDF

Info

Publication number
WO2023005140A1
WO2023005140A1 PCT/CN2021/142584 CN2021142584W WO2023005140A1 WO 2023005140 A1 WO2023005140 A1 WO 2023005140A1 CN 2021142584 W CN2021142584 W CN 2021142584W WO 2023005140 A1 WO2023005140 A1 WO 2023005140A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
feature
sequence
frames
frame sequence
Prior art date
Application number
PCT/CN2021/142584
Other languages
English (en)
French (fr)
Inventor
戴卫斌
王一
李伟琪
龚力
于波
周宇虹
Original Assignee
海宁奕斯伟集成电路设计有限公司
北京奕斯伟计算技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 海宁奕斯伟集成电路设计有限公司, 北京奕斯伟计算技术有限公司 filed Critical 海宁奕斯伟集成电路设计有限公司
Publication of WO2023005140A1 publication Critical patent/WO2023005140A1/zh

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234381Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the temporal resolution, e.g. decreasing the frame rate by frame skipping
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440281Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a video data processing method, device, equipment and storage medium.
  • frame sampling processing is often performed in a fixed manner to reduce the frame rate of the video to be processed.
  • frame interpolation processing is often used to increase the frame rate of the decoded frame sequence to improve the video quality.
  • Embodiments of the present application provide a video data processing method, device, device, and storage medium.
  • the embodiment of the present application provides a video data processing method, the method comprising:
  • the above-mentioned coded data carries the reference frame number of each reference frame, and each of the above-mentioned reference frames is adjacent to the target frame sequence extracted in the above-mentioned frame extraction process image frame;
  • the embodiment of the present application provides a video data processing method, the method comprising:
  • the encoded data sent by the second device and decode the encoded data to obtain a first frame sequence, where the first frame sequence is a frame sequence after the second device performs frame extraction processing on the initial frame sequence of the video to be processed;
  • each group of associated reference frames in the reference frames corresponding to the above-mentioned first frame sequence Based on the serial numbers of the reference frames carried by the above-mentioned encoded data, determine each group of associated reference frames in the reference frames corresponding to the above-mentioned first frame sequence.
  • For each group of associated reference frames based on the first reference frame and the second reference frame in the group of associated reference frames, determine the first predicted frame corresponding to the first reference frame and the second predicted frame corresponding to the second reference frame , and the occlusion weight and reconstruction residual in the frame prediction process, based on the first prediction frame, the second prediction frame, the occlusion weight and the reconstruction residual, determine the target prediction frame corresponding to the group of associated reference frames;
  • the embodiment of the present application provides a video data processing device, the device comprising:
  • Brightness determination module used to determine the initial frame sequence of the video to be processed, and determine the brightness of each pixel of each image frame in the above-mentioned initial frame sequence
  • a frame sequence determination module configured to perform frame extraction processing on the above initial frame sequence based on the brightness of each pixel of each image frame in the above initial frame sequence, and use the frame sequence after frame extraction as the first frame sequence;
  • An encoding module configured to encode the above-mentioned first frame sequence to obtain the above-mentioned coded data of the video to be processed, the above-mentioned coded data carries the reference frame serial number of each reference frame, and each of the above-mentioned reference frames is the target extracted in the above-mentioned frame extraction process adjacent image frames in frame sequence;
  • a sending module configured to send the encoded data to the first device, so that the first device determines a second frame sequence based on the encoded data and reference frames corresponding to each of the reference frame numbers, and determines to play based on the second frame sequence video.
  • the embodiment of the present application provides a video data processing device, which includes:
  • the decoding module is configured to obtain the encoded data sent by the second device, and decode the encoded data to obtain a first frame sequence, where the first frame sequence is a frame after the second device performs frame extraction processing on the initial frame sequence of the video to be processed sequence;
  • a reference frame determination module configured to determine each group of associated reference frames in the reference frames corresponding to the above-mentioned first frame sequence based on the serial numbers of the reference frames carried by the above-mentioned encoded data, and each group of the above-mentioned associated reference frames includes Two reference frames adjacent to the target frame sequence;
  • a frame prediction module configured to, for each group of the above-mentioned associated reference frames, determine the first predicted frame and the above-mentioned second reference frame corresponding to the above-mentioned first reference frame based on the first reference frame and the second reference frame in the group of associated reference frames For the corresponding second predicted frame, and the occlusion weight and reconstruction residual in the frame prediction process, based on the above-mentioned first predicted frame, the above-mentioned second predicted frame, the above-mentioned occlusion weight and the above-mentioned reconstruction residual, determine the corresponding Target prediction frame;
  • the video determination module is configured to perform frame interpolation processing on the target prediction frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtain a playback video based on the above second frame sequence.
  • the embodiment of the present application provides an electronic device, including a processor and a memory, and the processor and the memory are connected to each other;
  • the above-mentioned memory is used to store computer programs
  • the above-mentioned processor is configured to execute any video data processing method of the above-mentioned first aspect and/or second aspect when invoking the above-mentioned computer program.
  • an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and the computer program is executed by a processor to implement any one of the above-mentioned first aspect and/or second aspect.
  • a video data processing method is provided.
  • Fig. 1 is a network architecture diagram of the video data processing method provided by the embodiment of the present application.
  • Fig. 2 is a schematic flow chart of the video data processing method provided by the embodiment of the present application.
  • Fig. 3 is another schematic flowchart of the video data processing method provided by the embodiment of the present application.
  • Fig. 4 is a schematic diagram of a scene for determining context features provided by an embodiment of the present application.
  • Fig. 5 is a schematic structural diagram of the optical flow field estimation model provided by the embodiment of the present application.
  • FIG. 6 is a schematic diagram of a scene for determining residual features provided by the embodiment of the present application.
  • FIG. 7 is a schematic diagram of a scene for determining fusion features provided by an embodiment of the present application.
  • Fig. 8 is a schematic diagram of another scene for determining context features provided by the embodiment of the present application.
  • Fig. 9 is a schematic diagram of another scene for determining context features provided by the embodiment of the present application.
  • FIG. 10 is a schematic diagram of a scene for determining reconstruction residuals and occlusion weights provided by an embodiment of the present application.
  • Fig. 11 is a schematic diagram of another scene for determining occlusion weights and reconstruction residuals provided by the embodiment of the present application.
  • Fig. 12 is a schematic diagram of a scene for determining a target prediction frame provided by an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of a video data processing device provided by an embodiment of the present application.
  • FIG. 14 is another schematic structural diagram of a video data processing device provided by an embodiment of the present application.
  • FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a network architecture diagram of a video data processing method provided in an embodiment of the present application.
  • the second device 100 may determine an initial frame sequence of the video to be processed, and determine the brightness of each pixel of each image frame in the initial frame sequence. Further, the second device 100 may perform frame extraction processing on the initial frame sequence based on the brightness of each pixel of each image frame in the initial frame sequence, and use the frame sequence after frame extraction as the first frame sequence, that is, the second device 100 By performing frame extraction processing on the initial frame sequence, the frame rate of the video to be processed can be reduced.
  • the second device 100 may encode the first frame sequence to obtain encoded data of the video to be processed, and then send the encoded data to the first device through a network connection.
  • the coded data sent by the second device 100 carries reference frame numbers of each reference frame, and each reference frame is an image frame adjacent to the target frame sequence extracted in the frame extraction process.
  • the first device 200 may decode the encoded data to obtain the first frame sequence, and determine the frame sequence corresponding to the first frame sequence based on the reference frame numbers carried in the encoded data.
  • Each set of reference frames in a frame is associated with a reference frame.
  • each group of associated reference frames includes two reference frames adjacent to the target frame sequence extracted in the frame extraction process.
  • the first device 200 may determine the first predicted frame corresponding to the first reference frame and the second reference frame corresponding to the second reference frame based on the first reference frame and the second reference frame in the group of associated reference frames.
  • the second device 100 may perform frame interpolation processing on target prediction frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtain a playback video based on the second frame sequence.
  • the above-mentioned second device 100 may be a video collection device, such as a camera device, a video generation device, etc., or a video processing device, which may be determined based on actual application scenario requirements, and is not limited here.
  • the above-mentioned second device 100 may be a video forwarding device, or a video playback device, etc., which may be determined based on actual application scenario requirements, and is not limited here.
  • FIG. 2 is a schematic flowchart of a video data processing method provided by an embodiment of the present application.
  • the video data processing method provided in the embodiment of the present application may specifically include the following steps:
  • Step S21 Determine the initial frame sequence of the video to be processed, and determine the brightness of each pixel of each image frame in the initial frame sequence.
  • the video to be processed may be a video collected by the second device in real time, or a video generated by the second device based on video processing software, or a video obtained by the second device from the network, local storage space or The video obtained by cloud storage space, etc. can be determined based on the actual application scenario requirements, and there is no limitation here.
  • an initial frame sequence of the video to be processed may be determined, and the brightness of each pixel of each image frame in the initial frame sequence may be determined.
  • the pixel value of each color channel of each pixel point of the image frame can be determined, and for each pixel point of the image frame, based on each color channel of the pixel point The pixel value of a channel determines the brightness of that pixel.
  • Step S22 based on the brightness of each pixel of each image frame in the initial frame sequence, perform frame extraction processing on the initial frame sequence, and use the frame sequence after frame extraction as the first frame sequence.
  • frame extraction processing is performed on the initial frame sequence, and the frame sequence after frame extraction is used as the first frame sequence.
  • the high frame rate initial frame sequence of the video to be processed can be converted into the low frame rate first frame sequence.
  • the brightness difference between each pixel point of the image frame and the corresponding pixel point of the previous image frame can be determined, and then based on the brightness difference corresponding to each pixel point of the image frame, The total brightness difference between the image frame and the previous image frame is determined.
  • the brightness difference between a pixel of the image frame and the corresponding pixel of the previous image frame is the absolute value of the brightness difference between the two.
  • the total brightness difference ⁇ light between this image frame and the previous image frame is:
  • ⁇ light
  • I i, j (t) represents the brightness of the pixel point (i, j) of the t-th frame image frame
  • I i, j (t-1) represents the t-th frame image frame
  • the total brightness difference between any image frame and the previous image frame is greater than the first threshold, it means that the scene corresponding to the image frame has a larger scene change than the previous image frame. Therefore, in the initial frame sequence of the video to be processed, the corresponding image frame whose total brightness difference is greater than the first threshold can be determined as the active frame.
  • the total brightness difference between any image frame and the previous image frame is less than or equal to the first threshold, it means that the scene corresponding to the image frame has less change than the scene of the previous image frame , so the image frames whose corresponding total brightness difference is less than or equal to the first threshold in the initial frame sequence of the video to be processed can be determined as still frames.
  • the image frame may be marked to distinguish the image frame as an active frame or a still frame.
  • K(t) represents the mark of the tth frame
  • the total brightness difference ⁇ light of the tth frame image frame compared with the previous frame image frame is greater than the first threshold T1
  • the initial frame sequence may be subjected to frame extraction processing based on the active frame and the still frame in the initial frame sequence, so as to obtain the first frame sequence.
  • the continuous moving frame sequence and the continuous still frame sequence in the initial frame sequence may be determined, and frame extraction processing may be performed on the continuous moving frame sequence and the continuous still frame sequence in the initial frame sequence.
  • the target frame sequence can be used as the target frame sequence, and then the target frame sequence is changed from Extract from consecutive active frames.
  • other arbitrary image frames and/or continuous arbitrary number of image frames in the continuous still frames except the first image frame and the last image frame can be used as the target frame sequence, and then the target frame sequence is changed from the continuous Extract from still frames.
  • the target frame sequence is extracted from the continuous moving frame sequence and the continuous still frame sequence based on the above method, the initial frame sequence after the target frame sequence is extracted can be used as the first frame sequence.
  • the initial frame rate corresponding to the video to be processed can be determined, and the initial frame sequence is divided into at least one subframe based on the initial frame rate of the video to be processed sequence, and then determine the continuous motion frame sequence and the continuous static frame sequence in each sub-frame sequence, and extract the target frame sequence from each continuous motion frame sequence and continuous static frame sequence.
  • the target frame sequence extracted from the continuous moving frame sequence and static frame sequence in each subframe sequence is also any one of the corresponding continuous moving frame sequence or static frame sequence except the first frame and the last frame.
  • the duration of the video to be processed is 10s
  • its initial frame rate is 24Hz.
  • the initial frame sequence can be divided into 10 subframe sequences with a duration of 1 s, each subframe sequence includes 24 image frames, and frame extraction processing is performed on each subframe sequence.
  • frame extraction processing is performed on the continuous active frame sequence. That is to say, any other image frame and/or any number of continuous image frames in the continuous active frame except the first image frame and the last image frame are used as the target frame sequence, and then the target frame sequence is changed from the continuous active frame to to extract from.
  • the target frame sequence is changed from the continuous active frame to to extract from.
  • the third threshold perform frame extraction processing on the continuous still frame sequence.
  • the target frame sequence is taken from the continuous still frame to extract from.
  • the specific frame drawing method can be as follows:
  • T2 and T3 represent the second threshold and the third threshold respectively
  • frame numbers 3-20 correspond to continuous moving frames
  • frame numbers 26-35 correspond to continuous still frames
  • frame numbers 36-65 correspond to continuous moving frames. If the second threshold and the third threshold are 4, then the continuous motion frames corresponding to the frame number 3-20, the continuous static frames corresponding to the frame number 26-35 and the continuous motion frames corresponding to the frame number 36-65 can be extracted, except for the first Any image frame other than the first frame and the last frame or a plurality of consecutive image frames, so that the sub-frame sequence after frame sampling processing is determined as the first frame sequence.
  • Step S23 Encoding the first frame sequence to obtain encoded data of the video to be processed, the encoded data carries the reference frame number of each reference frame.
  • the first frame sequence after performing frame extraction processing on the initial frame sequence to obtain the first frame sequence, the first frame sequence may be encoded to obtain encoded data of the video to be processed.
  • the encoding method used when encoding the first frame sequence includes but is not limited to H.264, H.265, AVS2, and AV1, etc., which can be determined based on actual application scenario requirements, and is not limited here.
  • the coded data of the video to be processed carries the reference frame serial number of each reference frame
  • each reference frame is an image frame adjacent to the target frame sequence extracted in the frame extraction process. That is, after extracting the target frame sequence from the initial frame sequence to obtain the first frame sequence, two image frames adjacent to the extracted target frame sequence in the first frame sequence can be determined as reference frames, and each reference frame can be determined frame number.
  • the target frame sequence extracted from the active frame sequence is an image frame from frame number 4 to frame number 5
  • the corresponding The adjacent image frames are the image frame with frame number 3 and the image frame with frame number 6, which are determined as two reference frames.
  • the reference frame numbers of the reference frames can be encoded together with the first frame sequence, so that the coded data of the video to be processed carries the reference frame numbers.
  • the reference frame numbers and the coded data are further processed, so that the coded data of the video to be processed carries the serial numbers of the reference frames.
  • Step S24 Send the coded data to the first device, so that the first device determines the second frame sequence based on the coded data and the reference frames corresponding to the reference frame numbers, and determines to play the video based on the second frame sequence.
  • the coded data can be sent to the first device 200, so that the first device 200 can be based on the coded data and the reference frames corresponding to the reference frame numbers. frame to determine the second frame sequence, and determine to play the video based on the second frame sequence.
  • the specific ways for the second device 100 to send encoded data to the first device 200 include but are not limited to content delivery network (Content Delivery Network, CDN) transmission technology, peer-to-peer (Peer-to-peer, P2P) network transmission technology and PCDN transmission technology combining CDN and P2P.
  • CDN Content Delivery Network
  • P2P peer-to-peer
  • PCDN PCDN transmission technology
  • the second device 100 sends the coded data to the first device 200 , it also sends the frame sequence number of each reference frame to the first device 200 .
  • the initial frame sequence with a high frame rate can be converted into the first frame sequence with a low frame rate, so that encoding the first frame sequence will greatly
  • the size of the video data is reduced, and the data flow consumed by the encoding data transmission is correspondingly reduced, so as to achieve the effect of saving bandwidth costs.
  • FIG. 1 Another schematic flowchart of the video data processing method provided in the embodiment.
  • the video data processing method provided in the embodiment of the present application may specifically include the following steps:
  • Step S31 acquiring encoded data sent by the second device 100 , and decoding the encoded data to obtain a first frame sequence.
  • the first device 200 may decode the encoded data based on the encoding technology adopted by the second device 100 to obtain the first frame sequence.
  • the first frame sequence is the frame sequence obtained after the second device 100 performs frame extraction processing on the initial frame sequence of the video to be processed, that is, the remaining frame sequence after extracting the target frame sequence from the initial frame sequence.
  • Step S32 based on the sequence numbers of the reference frames carried in the coded data, determine each group of associated reference frames in the reference frames corresponding to the first frame sequence.
  • the sequence number of each image frame in the first frame sequence obtained by the first device 200 after decoding the encoded data is the frame sequence number corresponding to the frame sequence in the initial frame sequence of the video to be processed. Based on this, the first device 200 may determine a reference frame in the first frame sequence based on each reference frame number after acquiring each reference frame number.
  • the first device 200 may determine each group of associated reference frames from the reference frames corresponding to the first frame sequence, and each group of associated reference frames includes the frame extraction process of the initial frame sequence with the second device 100, Two adjacent reference frames of the extracted target frame sequence. Furthermore, the first device 200 may determine the target predicted frame corresponding to each group of associated reference frames, and perform frame interpolation processing on the target predicted frame to obtain the second frame sequence. Wherein, the first device 200 determines the target prediction frame corresponding to each group of associated reference frames, and performs frame interpolation processing on the target prediction frame to obtain the second frame sequence. The specific implementation manner is described in detail below, and will not be described here.
  • Step S33 for each group of associated reference frames, based on the first reference frame and the second reference frame in the group of associated reference frames, determine the first predicted frame corresponding to the first reference frame and the second predicted frame corresponding to the second reference frame , and the occlusion weight and reconstruction residual in the frame prediction process, based on the first prediction frame, the second prediction frame, the occlusion weight and the reconstruction residual, determine the target prediction frame corresponding to the group of associated reference frames.
  • the first predicted frame and the second reference frame corresponding to the first reference frame may be determined based on the first reference frame and the second reference frame in the group of associated reference frames The corresponding second predicted frame.
  • the first reference frame in the group of associated reference frames is a reference frame with a smaller reference frame number
  • the second reference frame is a reference frame with a larger reference frame number.
  • Both the first predicted frame and the second predicted frame are image frames between the first reference frame and the second reference frame.
  • the first optical flow field corresponding to the first reference frame and the second optical flow field corresponding to the second reference frame can be determined.
  • the optical flow field refers to a two-dimensional instantaneous velocity field composed of all pixels in the image, which includes the changes of pixels in the time domain and the correlation between adjacent frames. The corresponding relationship between the frame and the current frame.
  • the feature extraction can be performed on the first reference frame to obtain the first initial feature, and the feature extraction can be performed on the second reference frame to obtain The second initial feature, and then based on the first initial feature and the second initial feature, the associated features corresponding to the first reference frame and the second reference frame are obtained.
  • feature extraction may be performed on each reference frame based on a neural network to obtain corresponding initial features.
  • the associated features of the first initial feature and the second initial feature can be obtained based on feature splicing, feature fusion, or further processing based on other neural network models.
  • the specific implementation method can be based on actual The requirements of the application scenario are determined, and there is no limitation here.
  • a first context feature of the first reference frame may be determined, and based on the first context feature and the foregoing associated features, a first optical flow field corresponding to the first reference frame may be determined.
  • a second context feature of the second reference frame may be determined, and based on the second context feature and the aforementioned associated features, a second optical flow field corresponding to the second reference frame may be determined.
  • FIG. 4 is a schematic diagram of a scene for determining context features provided by an embodiment of the present application.
  • the contextual feature extraction network shown in Figure 4 includes multiple concatenated convolutional layers and activation function combinations.
  • the reference frame can be convolved based on the first convolution layer to obtain the first convolution feature, and the first activation function is used to A convolutional feature is processed to obtain a feature map of the reference frame.
  • the feature map obtained by each activation function in Figure 4 can be determined as the context feature of the reference frame.
  • the number of convolutional layers and activation functions in the feature extraction network shown in FIG. 4 can be specifically determined based on actual application scenario requirements, and is not limited here.
  • optical flow field corresponding to the first reference frame and the second reference frame of each group of associated reference frames it can be determined based on the optical flow field estimation model (Recurrent All-Pairs Field Transforms for Optical Flow, RAFT).
  • FIG. 5 is a schematic structural diagram of an optical flow field estimation model provided by an embodiment of the present application.
  • the first reference frame I a and the second reference frame I d in the group of associated reference frames are input into the feature encoding module, so that the first reference frame is respectively analyzed based on the feature encoding module I a and the second reference frame I d perform feature extraction to obtain the first initial feature and the second initial feature, and further perform feature association based on the first initial feature and the second initial feature to obtain the associated feature.
  • I represents a reference frame
  • a and d are position information (such as frame number, time domain position, etc.) of the reference frame respectively.
  • the first context feature C a of the first reference frame I a may be determined based on the context feature extraction network.
  • the first context feature C a can be expressed as as well as are the feature maps obtained based on a convolutional layer and an activation function, respectively.
  • the second context feature C d of the second reference frame I d may be determined based on the context feature extraction network.
  • the second context feature C d can be expressed as as well as are the feature maps obtained based on a convolutional layer and an activation function, respectively. Input the second context feature C d and associated features of the second reference frame I d into the cyclic neural network to obtain the second optical flow field F b ⁇ d of the second reference frame I d , where b is the set of associated reference frames corresponding to The position information of the target prediction frame of , where a is greater than b greater than d.
  • the first reference frame is backward mapped based on the first optical flow field to obtain the first predicted frame corresponding to the first reference frame
  • the second reference frame is backward mapped based on the second optical flow field to obtain the second A second predicted frame corresponding to the reference frame. For example, based on the first optical flow field F b ⁇ a , the first reference frame I a is backward mapped to obtain the first predicted frame
  • the second reference frame I d is backward mapped to obtain the second predicted frame
  • the occlusion weight and reconstruction residual in the frame prediction process are determined.
  • the reconstruction residual is used to reduce the gradient descent problem in the frame prediction process
  • the occlusion weight is used to reduce the impact of moving objects shaking and edge blurring in the frame prediction process.
  • the third context feature of the first reference frame and the fourth context feature of the second reference frame may be determined first.
  • the third context feature of the first reference frame and the fourth context feature of the second reference frame can be determined based on the manner shown in FIG. 4 , or can be determined based on other context feature extraction networks, and specifically can be determined based on actual application scenario requirements, There is no limitation here.
  • the occlusion weight and reconstruction residual in the frame prediction process may be determined.
  • the first optical flow field, the second optical flow field, the first predicted frame, the second predicted frame, the third contextual feature, and the fourth contextual feature are input into the deep neural network to obtain the occlusion weight and reconstruction residual in the frame prediction process.
  • the above-mentioned deep neural networks include but are not limited to FusionNet and U-Net, which can be determined based on actual application scenario requirements, and are not limited here.
  • the occlusion weight in the frame prediction process is determined based on the first optical flow field, the second optical flow field, the first predicted frame, the second predicted frame, the third context feature and the fourth context feature
  • the residual feature can be determined based on the first optical flow field, the second optical flow field, the first predicted frame, and the second predicted frame.
  • FIG. 6 is a schematic diagram of a scene for determining residual features provided by an embodiment of the present application.
  • the first optical flow field, the second optical flow field, the first predicted frame, and the second predicted frame are input to the convolutional layer, and after processing through the convolutional neural network and activation function, the processing results are input to Residual features are obtained in the residual block.
  • FIG. 7 is a schematic diagram of a scene for determining fusion features provided by an embodiment of the present application.
  • the residual feature is combined with the first context feature in the third context feature and the fourth context feature and Splicing is performed to obtain the first splicing feature, and the first splicing feature is input into the convolution layer to perform down-sampling convolution processing to obtain the first convolution feature.
  • Combine the first convolutional feature with the second contextual feature in the third contextual feature and the fourth contextual feature and Splicing is performed to obtain the second splicing feature, and the second splicing feature is input into the convolution layer to perform down-sampling convolution processing to obtain the second convolution feature.
  • Combine the second convolutional feature with the third contextual feature in the third contextual feature and the fourth contextual feature and Splicing is performed to obtain the third splicing feature, and the third splicing feature is input into the convolution layer for down-sampling convolution processing to obtain the third convolution feature.
  • Combine the third convolutional feature with the third contextual feature and the fourth contextual feature in the fourth contextual feature and Splicing is performed to obtain the fourth splicing feature, and the fourth splicing feature is input into the convolution layer for down-sampling convolution processing to obtain the fourth convolution feature.
  • Combine the fourth convolutional feature with the third contextual feature and the fifth contextual feature in the fourth contextual feature and Splicing is performed to obtain the fifth splicing feature, and the fifth splicing feature is input into the convolution layer for upsampling convolution processing to obtain the fifth convolution feature.
  • the fifth convolutional feature and the third convolutional feature are concatenated to obtain the sixth concatenated feature, and the sixth concatenated feature is input to the convolution layer for upsampling processing to obtain the sixth convoluted feature.
  • the sixth convolutional feature and the second convolutional feature are concatenated to obtain the seventh concatenated feature, and the seventh concatenated feature is input to the convolution layer for up-sampling processing to obtain the seventh convoluted feature.
  • the seventh convolution feature and the first convolution feature are concatenated to obtain the eighth concatenation feature, and the eighth concatenation feature is input to the convolution layer for upsampling processing to obtain the ninth convolution feature.
  • the fusion feature can be obtained by concatenating the ninth convolution feature and the residual feature.
  • the number of convolutional layers used for upsampling processing is the same as the number of convolutional layers used for downsampling processing, and the specific number is the same as the third context feature or the fourth The number of feature maps in context features is consistent.
  • each feature map in the third context feature can be backward mapped based on the first optical flow field, to obtain
  • the fifth contextual feature is to perform backward mapping on each feature map in the fourth contextual feature based on the second optical flow field to obtain the sixth contextual feature.
  • the fusion feature is determined based on the fifth context feature, the sixth context feature, and the residual feature. The specific determination method is the same as that shown in FIG. 7 , and will not be described here again.
  • FIG. 8 is a schematic diagram of another scene for determining context features provided by an embodiment of the present application.
  • each feature map of the first reference frame is determined based on the combination of each convolutional layer and activation parameters, and the third context feature is obtained
  • each feature in the third context feature can be analyzed based on the first optical flow field
  • the maps are respectively backward mapped to obtain the mapped feature maps corresponding to each feature map, and then each mapped feature map is determined as the fifth context feature.
  • the feature can be determined The optical flow field weight corresponding to the map is used to determine a new optical flow field corresponding to the feature map when backward mapping is performed based on the optical flow field weight and the first optical flow field. Furthermore, for each feature map in the third context feature, the feature map can be backward mapped based on the new optical flow field corresponding to the feature map to obtain the mapped feature map corresponding to the feature map. Further, the fifth context feature is determined based on the mapping feature map corresponding to each feature map in the third context feature.
  • FIG. 9 is a schematic diagram of another scene for determining context features provided by an embodiment of the present application.
  • the optical flow field weight corresponding to each feature map in the third context feature can be determined, such as 1, 0.5, 0.25, 0.125, 0.0625, etc., and then based on The first optical flow field and the weights of each optical flow field determine new optical flow fields corresponding to each feature map, such as optical flow field 1, optical flow field 2, optical flow field 3, optical flow field 4, and optical flow field 5.
  • each feature map is backward mapped based on the new optical flow field corresponding to each feature map to obtain a mapped feature map corresponding to each feature map, and each mapped feature map is determined as a fifth context feature.
  • the optical flow field weight corresponding to the feature map can be determined, so as to determine the backward mapping of the feature map based on the optical flow field weight and the second optical flow field.
  • the corresponding new optical flow field can be determined for each feature map in the fourth context feature.
  • the feature map can be backward mapped based on the new optical flow field corresponding to the feature map to obtain the mapped feature map corresponding to the feature map.
  • the sixth context feature is determined.
  • optical flow field weights corresponding to the feature maps in the third context feature and the fourth context feature may be specifically determined based on actual application scenarios, and there is no limitation here.
  • FIG. 10 is a schematic diagram of a scene for determining a reconstruction residual and an occlusion weight provided by an embodiment of the present application.
  • the fusion features can be input to the convolutional layer for further processing on the fusion features, and the processing results are subjected to sub-pixel convolution to obtain high-resolution target features.
  • the occlusion weight and reconstruction residual in the frame prediction process are determined based on the target features.
  • the number of channels corresponding to the target feature and the feature values corresponding to each channel can be determined. Then, the eigenvalue of the last channel is determined as the occlusion weight in the frame prediction process, and the reconstruction residual in the frame prediction process is determined based on the eigenvalues of other channels. For example, the eigenvalues corresponding to other channels except the last channel are spliced to obtain the reconstruction residual in the frame prediction process.
  • FIG. 11 is a schematic diagram of another scene for determining occlusion weights and reconstruction residuals provided by an embodiment of the present application. That is, by the method shown in Figure 6, the residual feature is determined based on the first optical flow field, the second optical flow field, the first predicted frame and the second predicted frame, and by the method shown in Figure 7, based on the residual feature and the first predicted frame Each feature map in the three context features and the fourth context feature determines the fusion feature. Furthermore, through the method shown in FIG. 10 , the reconstruction residual and the occlusion weight in the frame prediction process are determined based on the residual feature.
  • the group of associated reference frames can be determined based on the first predicted frame, the second predicted frame, the occlusion weight and the reconstruction residual.
  • the specific determination method can be as follows:
  • Step S34 Perform frame interpolation processing on the target prediction frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtain a playback video based on the second frame sequence.
  • the target prediction frame corresponding to the associated reference frame is the target prediction frame between the first reference frame and the second reference frame of the group of associated reference frames. Based on this, the target prediction frame corresponding to each group of associated reference frames can be subjected to frame interpolation processing, and the target prediction frame corresponding to each associated reference frame is inserted between the first reference frame and the second reference frame of the associated reference frame, so that The second frame sequence is obtained on the basis of the first frame sequence.
  • the first device may determine to play the video based on the second frame sequence, that is, the second frame sequence is a frame sequence corresponding to the video played by the first device.
  • FIG. 12 is a schematic diagram of a scene for determining a target prediction frame provided by an embodiment of the present application.
  • the first optical flow field corresponding to the first reference frame and the second optical flow field corresponding to the second reference frame are determined through the RAFT model, and the first reference frame is backward mapped based on the first optical flow field
  • the first predicted frame is obtained, and the second predicted frame is obtained by performing backward mapping on the second reference frame based on the second optical flow field.
  • the third context feature corresponding to the first reference frame and the fourth context feature corresponding to the second reference frame are respectively determined through the context feature extraction network (ContextNet), and the feature maps in the third context feature are processed based on the first optical flow field.
  • ContextNet context feature extraction network
  • the fifth contextual feature is obtained through forward mapping
  • the sixth contextual feature is obtained through backward mapping of each feature map in the fourth contextual feature based on the second optical flow field.
  • the prediction frame, the first optical flow field and the second optical flow field, as well as the occlusion weight and reconstruction residual in the frame prediction process can fully consider the occlusion information in the frame prediction process, the detailed information of each image frame and the optical flow field information , which can effectively solve the problems of object shaking and edge blurring in the frame prediction process, thereby improving video clarity and enhancing video viewing experience.
  • FIG. 13 is a schematic structural diagram of a video data processing device provided by an embodiment of the present application.
  • the video data processing device provided in the embodiment of the present application includes:
  • the brightness determination module 41 is used to determine the initial frame sequence of the video to be processed, and determine the brightness of each pixel of each image frame in the above-mentioned initial frame sequence;
  • the frame sequence determination module 42 is used to perform frame extraction processing on the above-mentioned initial frame sequence based on the brightness of each pixel point of each image frame in the above-mentioned initial frame sequence, and use the frame sequence after frame extraction as the first frame sequence;
  • the encoding module 43 is configured to encode the above-mentioned first frame sequence to obtain the above-mentioned coded data of the video to be processed, the above-mentioned coded data carries the reference frame serial number of each reference frame, and each of the above-mentioned reference frames is extracted from the above-mentioned frame extraction process Image frames adjacent to the target frame sequence;
  • the sending module 44 is configured to send the encoded data to the first device, so that the first device determines the second frame sequence based on the encoded data and the reference frames corresponding to the reference frame numbers, and determines the second frame sequence based on the second frame sequence. Play video.
  • the frame sequence determination module 42 is configured to:
  • an image frame whose corresponding total brightness difference is greater than a first threshold is determined as an active frame, and an image frame whose corresponding total brightness difference is less than or equal to the above first threshold is determined as a still frame;
  • frame extraction processing is performed on the above-mentioned initial frame sequence.
  • the frame sequence determination module 42 is configured to:
  • Frame extraction processing is performed on the continuous moving frame sequence and the continuous still frame sequence in the above initial frame sequence.
  • the frame sequence determination module 42 is configured to:
  • the continuous still frame sequence is subjected to frame extraction processing.
  • the above video data processing device can implement the implementation methods provided by the above steps in FIG. 1 through various built-in functional modules.
  • the implementation methods provided by the above steps please refer to the implementation methods provided by the above steps, which will not be repeated here.
  • FIG. 14 is another schematic structural diagram of a video data processing device provided by an embodiment of the present application.
  • the video data processing device provided in the embodiment of the present application includes:
  • the decoding module 51 is configured to obtain the encoded data sent by the second device, and decode the encoded data to obtain a first frame sequence.
  • the first frame sequence is obtained after the second device performs frame extraction processing on the initial frame sequence of the video to be processed. frame sequence;
  • the reference frame determination module 52 is configured to determine each group of associated reference frames in the reference frames corresponding to the above-mentioned first frame sequence based on the reference frame numbers carried by the above-mentioned encoded data, and each group of the above-mentioned associated reference frames includes Two adjacent reference frames of the extracted target frame sequence;
  • the frame prediction module 53 is configured to, for each group of the above-mentioned associated reference frames, determine the first predicted frame corresponding to the above-mentioned first reference frame and the above-mentioned second reference frame based on the first reference frame and the second reference frame in the group of associated reference frames.
  • the video determination module 54 is configured to perform frame interpolation processing on the target prediction frames corresponding to each group of associated reference frames to obtain a second frame sequence, and obtain a playback video based on the above second frame sequence.
  • the above-mentioned frame prediction module 53 is configured to:
  • the above-mentioned frame prediction module 53 is configured to:
  • the above-mentioned frame prediction module 53 is configured to:
  • the above-mentioned second optical flow field determines the occlusion weight and reconstruction residual in the frame prediction process Difference.
  • the above-mentioned frame prediction module 53 is used to:
  • the occlusion weights and reconstruction residuals in the frame prediction process are determined based on the above fused features.
  • the above-mentioned third context feature and the above-mentioned fourth context feature include multiple feature maps; the above-mentioned frame prediction module 53 is configured to:
  • mapping feature map corresponding to each of the above feature maps in the above third context feature as the fifth context feature of the first reference frame
  • mapping feature map corresponding to each of the above feature maps in the fourth context feature as the sixth context feature of the second reference frame
  • a fusion feature is determined based on the fifth context feature, the sixth context feature, and the residual feature.
  • the above-mentioned frame prediction module 53 is used to:
  • the reconstruction residuals are determined.
  • the above video data processing device can implement the implementation methods provided by the above steps in FIG. 3 through various built-in functional modules.
  • the implementation methods provided by the above steps please refer to the implementation methods provided by the above steps, which will not be repeated here.
  • FIG. 15 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 1000 in this embodiment may include: a processor 1001, a network interface 1004, and a memory 1005.
  • the above-mentioned electronic device 1000 may also include: a user interface 1003, and at least one communication bus 1002.
  • the communication bus 1002 is used to realize connection and communication between these components.
  • the user interface 1003 may include a display screen (Display) and a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface.
  • the network interface 1004 may include a standard wired interface and a wireless interface (such as a WI-FI interface).
  • the memory 1004 may be a high-speed RAM memory, or a non-volatile memory, such as at least one disk memory.
  • the memory 1005 may also be at least one storage device located away from the aforementioned processor 1001 .
  • the memory 1005 as a computer-readable storage medium may include an operating system, a network communication module, a user interface module, and a device control application program.
  • the network interface 1004 can provide a network communication function; the user interface 1003 is mainly used to provide an input interface for the user; and the processor 1001 can be used to call the device control application stored in the memory 1005
  • the program is used to realize the video data processing method executed by the first device and/or the second device.
  • the above-mentioned processor 1001 may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP) , application specific integrated circuit (ASIC), off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory which can include read only memory and random access memory, provides instructions and data to the processor. A portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.
  • the above-mentioned electronic device 1000 can execute the implementation methods provided by the above-mentioned steps in FIG. 2 and/or FIG. 3 through its built-in functional modules.
  • the embodiment of the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and is executed by a processor to implement the method provided by each step in FIG. 2 and/or FIG. 3.
  • the computer-readable storage medium stores a computer program, and is executed by a processor to implement the method provided by each step in FIG. 2 and/or FIG. 3.
  • a processor to implement the method provided by each step in FIG. 2 and/or FIG. 3.
  • the foregoing computer-readable storage medium may be the video data processing apparatus and/or an internal storage unit of the electronic device provided in any of the foregoing embodiments, such as a hard disk or a memory of the electronic device.
  • the computer-readable storage medium may also be an external storage device of the electronic device, such as a plug-in hard disk equipped on the electronic device, a smart memory card (smart media card, SMC), a secure digital (secure digital, SD) card, Flash card (flash card), etc.
  • the above-mentioned computer-readable storage medium may also include a magnetic disk, an optical disk, a read-only memory (read-only memory, ROM) or a random access memory (random access memory, RAM), etc.
  • the computer-readable storage medium may also include both an internal storage unit of the electronic device and an external storage device.
  • the computer-readable storage medium is used to store the computer program and other programs and data required by the electronic device.
  • the computer-readable storage medium can also be used to temporarily store data that has been output or will be output.
  • An embodiment of the present application provides a computer program product or computer program, where the computer program product or computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium.
  • the processor of the electronic device reads the computer instruction from the computer-readable storage medium, and the processor executes the computer instruction, so that the computer device executes the method provided by each step in FIG. 2 and/or FIG. 3 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

本申请实施例公开了一种视频数据处理方法、装置、设备以及存储介质。该方法包括:确定待处理视频的初始帧序列,确定初始帧序列中各图像帧的各像素点的亮度;基于初始帧序列中各图像帧的各像素点的亮度,对初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列;对第一帧序列进行编码,得到待处理视频的编码数据,编码数据携带各参考帧的参考帧序号,各参考帧为与抽帧处理中抽取出的目标帧序列相邻的图像帧;向第一设备发送编码数据,以使第一设备基于编码数据和各参考帧序号对应的参考帧,确定第二帧序列,并基于第二帧序列确定播放视频。

Description

视频数据处理方法、装置、设备以及存储介质
相关申请的交叉引用
本申请要求于2021年7月30日向中国国家知识产权局递交的中国专利申请No.202110874693.7的优先权,其全部公开内容通过引用并入本文。
技术领域
本申请涉及人工智能领域,尤其涉及一种视频数据处理方法、装置、设备以及存储介质。
背景技术
随着互联网技术的高速发展,短视频、在线直播等多媒体数据激增,不断增长的视频用户对视觉效果的要求越来越高,对于视频数据处理方法,如何减少带宽消耗的同时,保证视频的视觉效果提出了挑战。
目前在对待处理视频进行编码时,往往采用固定的方式进行抽帧处理以降低待处理视频的帧率。在对待处理视频的编码数据进行解码后,往往还会采用插帧处理以提升解码得到的帧序列的帧率,以提升视频质量。
发明内容
本申请实施例提供一种视频数据处理方法、装置、设备以及存储介质。
第一方面,本申请实施例提供一种视频数据处理方法,该方法包括:
确定待处理视频的初始帧序列,确定上述初始帧序列中各图像帧的各像素点的亮度;
基于上述初始帧序列中各图像帧的各像素点的亮度,对上述初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列;
对上述第一帧序列进行编码,得到上述待处理视频的编码数据,上述编码数据携带各参考帧的参考帧序号,各上述参考帧为与上述抽帧处理中抽取出的目标帧序列相邻的图像帧;
向第一设备发送上述编码数据,以使上述第一设备基于上述编码数据和各上述参考帧序号对应的参考帧,确定第二帧序列,并基于上述第二帧序列确定播放视频。
第二方面,本申请实施例提供了一种视频数据处理方法,该方法包括:
获取第二设备发送的编码数据,对上述编码数据进行解码得到第一帧序列,上述第一帧序列是上述第二设备对待处理视频的初始帧序列进行抽帧处理后的帧序列;
基于上述编码数据携带的各参考帧序号,确定上述第一帧序列对应的参考帧中的各组关联参考帧,每组上述关联参考帧包括与上述抽帧处理中抽取出的目标帧序列相邻的两个参考帧;
对于每组上述关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定上述第一参考帧对应的第一预测帧和上述第二参考帧对应的第二预测帧、以及帧预测过程中的遮挡权重和重建残差,基于上述第一预测帧、上述第二预测帧、上述遮挡权重和上述重建残差,确定该组关联参考帧对应的目标预测帧;
将各组上述关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于上述第二帧序列得到播放视频。
第三方面,本申请实施例提供了一种视频数据处理装置,该装置包括:
亮度确定模块,用于确定待处理视频的初始帧序列,确定上述初始帧序列中各图像帧的各像素点的亮度;
帧序列确定模块,用于基于上述初始帧序列中各图像帧的各像素点的亮度,对上述初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列;
编码模块,用于对上述第一帧序列进行编码,得到上述待处理视频的编码数据,上述编码数据携带各参考帧的参考帧序号,各上述参考帧为与上述抽帧处理中抽取出的目标帧序列相邻的图像帧;
发送模块,用于向第一设备发送上述编码数据,以使上述第一设备基于上述编码数据和各上述参考帧序号对应的参考帧,确定第二帧序列,并基于上述第二帧序列确定播放视频。
第四方面,本申请实施例提供了一种视频数据处理装置,该装置包括:
解码模块,用于获取第二设备发送的编码数据,对上述编码数据进行解码得到第一帧序列,上述第一帧序列是上述第二设备对待处理视频的初始帧 序列进行抽帧处理后的帧序列;
参考帧确定模块,用于基于上述编码数据携带的各参考帧序号,确定上述第一帧序列对应的参考帧中的各组关联参考帧,每组上述关联参考帧包括与上述抽帧处理中抽取出的目标帧序列相邻的两个参考帧;
帧预测模块,用于对于每组上述关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定上述第一参考帧对应的第一预测帧和上述第二参考帧对应的第二预测帧、以及帧预测过程中的遮挡权重和重建残差,基于上述第一预测帧、上述第二预测帧、上述遮挡权重和上述重建残差,确定该组关联参考帧对应的目标预测帧;
视频确定模块,用于将各组上述关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于上述第二帧序列得到播放视频。
第五方面,本申请实施例提供了一种电子设备,包括处理器和存储器,该处理器和存储器相互连接;
上述存储器用于存储计算机程序;
上述处理器被配置用于在调用上述计算机程序时,执行上述第一方面和/或第二方面任一种视频数据处理方法。
第六方面,本申请实施例提供了一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,该计算机程序被处理器执行以实现上述第一方面和/或第二方面任一种视频数据处理方法。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的视频数据处理方法的网络架构图;
图2是本申请实施例提供的视频数据处理方法的一流程示意图;
图3是本申请实施例提供的视频数据处理方法的另一流程示意图;
图4是本申请实施例提供的确定上下文特征的一场景示意图;
图5是本申请实施例提供的光流场估计模型的一结构示意图;
图6是本申请实施例提供的确定残差特征的一场景示意图;
图7是本申请实施例提供的确定融合特征的一场景示意图;
图8是本申请实施例提供的确定上下文特征的另一场景示意图;
图9是本申请实施例提供的确定上下文特征的又一场景示意图;
图10是本申请实施例提供的确定重建残差和遮挡权重的场景示意图;
图11是本申请实施例提供的确定遮挡权重和重建残差的另一场景示意图;
图12是本申请实施例提供的确定目标预测帧的一场景示意图;
图13是本申请实施例提供的视频数据处理装置的一结构示意图;
图14是本申请实施例提供的视频数据处理装置的另一结构示意图;
图15是本申请实施例提供的电子设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
参见图1,图1是本申请实施例提供的视频数据处理方法的网络架构图。如图1所示,第二设备100在获取到待处理视频后,可确定待处理视频的初始帧序列,并确定初始帧序列中各图像帧的各像素点的亮度。进一步地,第二设备100可基于初始帧序列中各图像帧的各像素点的亮度,对初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列,即第二设备100通过对初始帧序列进行抽帧处理,可降低待处理视频的帧率。
第二设备100在得到第一帧序列之后,可对第一帧序列进行编码,得到待处理视频的编码数据,进而通过网络连接将编码数据发送至第一设备。其中,第二设备100发送的编码数据携带有各参考帧的参考帧序号,各参考帧为与抽帧处理过程中抽取出的目标帧序列相邻的图像帧。
对应的,第一设备200在获取到第二设备100发送的编码数据之后,可对编码数据进行解码得到第一帧序列,并基于编码数据携带的各参考帧序号,确定第一帧序列对应的参考帧帧中的各组关联参考帧。其中,每组关联参考帧包括与抽帧处理中抽取出的目标帧序列相邻的两个参考帧。
对于每一组关联参考帧,第一设备200可基于该组关联参考帧中的第一参考帧和第二参考帧,确定第一参考帧对应的第一预测帧和第二参考帧对应的第二预测帧、以及帧预测过程中的遮挡权重和重建残差,基于第一预测帧、第二预测帧、遮挡权重和重建残差,确定该组关联参考帧对应的目标预测帧;
进一步地,第二设备100可将各组关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于第二帧序列得到播放视频。
其中,上述第二设备100可以为视频采集设备,如摄像设备、视频生成设备等,也可以为视频处理设备,具体可基于实际应用场景需求确定,在此不做限制。
其中,上述第二设备100可以为视频转发设备,也可以为视频播放设备等,具体也可基于实际应用场景需求确定,在此不做限制。
参见图2,图2是本申请实施例提供的视频数据处理方法的一流程示意图。本申请实施例提供的视频数据处理方法在应用于第二设备时,可具体包括如下步骤:
步骤S21、确定待处理视频的初始帧序列,确定初始帧序列中各图像帧的各像素点的亮度。
在一些可行的实施方式中,待处理视频可以为第二设备实时采集到的视频,也可以为第二设备基于视频处理软件所生成的视频,还可以为第二设备从网络、本地存储空间或者云存储空间等获取到的视频,具体可基于实际应用场景需求确定,在此不做限制。
在一些可行的实施方式中,对于待处理视频,可确定待处理视频的初始帧序列,并确定初始帧序列中各图像帧的各像素点的亮度。具体地,对于初始帧序列中的任一图像帧,可确定该图像帧的每一像素点的各颜色通道的像素值,对于该图像帧的每一像素点,可基于该像素点的各颜色通道的像素值确定该像素点的亮度。
步骤S22、基于初始帧序列中各图像帧的各像素点的亮度,对初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列。
在一些可行的实施方式中,基于初始帧序列中各图像帧的各像素点的亮度,对初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列。通过 对初始帧序列进行抽帧处理,可将待处理视频的高帧率的初始帧序列转换为低帧率的第一帧序列。
具体地,对于初始帧序列中的任一图像帧,可确定该图像帧的各像素点与前一图像帧的对应像素点的亮度差,进而基于该图像帧在各像素点对应的亮度差,确定该图像帧与前一图像帧的总亮度差。
其中,对于初始帧序列中的任一图像帧,该图像帧的一个像素点与前一图像帧的对应像素点的亮度差,为二者亮度差的绝对值。
例如,对于初始帧序列中的任一图像帧,该图像帧与前一图像帧的总亮度差Δ light为:
Δ light=∑|I i,j(t)-I i,j(t-1)|
其中,i和j用于表示像素点的位置,I i,j(t)表示第t帧图像帧的像素点(i,j)的亮度,I i,j(t-1)表示第t-1帧图像帧的像素点(i,j)的亮度,|I i,j(t)-I i,j(t-1)|表示第t帧图像帧和第t-1帧图像帧在像素点(i,j)的亮度差的绝对值。
进一步地,若待处理视频的初始帧序列中,任一图像帧与前一图像帧的总亮度差大于第一阈值,则说明该图像帧对应的场景相较于前一图像帧的场景变化较大,因此可将待处理视频的初始帧序列中,对应的总亮度差大于第一阈值的图像帧确定为活动帧。若待处理视频的初始帧序列中,任一图像帧与前一图像帧的总亮度差小于或者等于第一阈值,则说明该图像帧对应的场景相较于前一图像帧的场景变化较小,因此可将待处理视频的初始帧序列中,对应的总亮度差小于或者等于第一阈值的图像帧确定为静止帧。
作为一示例,对于初始帧序列的任一图像帧,可对该图像帧进行标记以区分该图像帧为活动帧或者静止帧。
Figure PCTCN2021142584-appb-000001
如上式所示,K(t)表示第t帧的标记,若第t帧图像帧相较于前一帧图像帧的总亮度差Δ light大于第一阈值T1,则对第t帧图像帧标记为1,以表示第t帧图像帧为活动帧。若第t帧图像帧相较于前一帧图像帧的总亮度差Δ light不 大于第一阈值T1,则对第t帧图像帧标记为0,以表示第t帧图像帧为静止帧。
进一步地,在确定出初始帧序列中的活动帧和静止帧之后,可基于初始帧序列中的活动帧和静止帧,对初始帧序列进行抽帧处理,从而得到第一帧序列。具体可确定初始帧序列中的连续活动帧序列和连续静止帧序列,对初始帧序列中的连续活动帧序列和连续静止帧序列进行抽帧处理。
作为一示例,可将连续活动帧中除第一帧图像帧和最后一帧图像帧外的其他任意一帧图像帧和/或连续任意数量的图像帧作为目标帧序列,进而将目标帧序列从连续活动帧中进行抽取。同样的,可将连续静止帧中除第一帧图像帧和最后一帧图像帧外的其他任意一帧图像帧和/或连续任意数量的图像帧作为目标帧序列,进而将目标帧序列从连续静止帧中进行抽取。在基于上述方式从连续活动帧序列和连续静止帧序列中抽取出目标帧序列之后,可将抽取目标帧序列后的初始帧序列作为第一帧序列。
在一些可行的实施方式中,为集中对相同场景下的视频进行抽帧处理,可确定待处理视频对应的初始帧率,基于待处理视频的初始帧率将初始帧序列划分为至少一个子帧序列,进而确定各子帧序列中的连续活动帧序列和连续静止帧序列,并从各连续活动帧序列和连续静止帧序列中抽取目标帧序列。
其中,从各子帧序列中的连续活动帧序列和静止帧序列中抽取的目标帧序列,同样为相对应的连续活动帧序列或者静止帧序列中除第一帧和最后一帧外的任意一帧图像帧或者任意连续几帧的图像帧。
如待处理视频的时长为10s,其初始帧率为24Hz。则可将初始帧序列划分为10个时长为1s的子帧序列,每个子帧序列包括24帧图像帧,进而对每一子帧序列进行抽帧处理。
在一些可行的实施方式中,对于各子帧序列中的每一连续活动帧,若该连续活动帧序列中的活动帧的数量大于第二阈值,则对该连续活动帧序列进行抽帧处理。即将该连续活动帧中除第一帧图像帧和最后一帧图像帧外的其他任意一帧图像帧和/或连续任意数量的图像帧作为目标帧序列,进而将目标帧序列从该连续活动帧中进行抽取。对于各子帧序列中的每一连续静止帧,若该连续静止帧序列中的静止帧的数量大于第三阈值,则对该连续静止帧序 列进行抽帧处理。即将该连续静止帧中除第一帧图像帧和最后一帧图像帧外的其他任意一帧图像帧和/或连续任意数量的图像帧作为目标帧序列,进而将目标帧序列从该连续静止帧中进行抽取。对于任一子帧序列,若该子帧序列的连续活动帧中活动帧的数量均小于或者等于第二阈值,且连续静止帧中静止帧的数量均小于或者等于第三阈值,则不对该子帧序列进行抽帧处理(no extraction)。则具体抽帧方式可如下所示:
Figure PCTCN2021142584-appb-000002
其中,P表示抽取出的目标帧序列,N(K(t)=1)表示连续活动帧序列中活动帧的数量,N(K(t)=0)表示连续静止帧序列中静止帧的数量,T2、T3表示分别表示第二阈值和第三阈值,I {2,3,4,…,last-1}(K(t)=1)表示连续活动帧序列中的任意中间帧,I {2,3,4,…,last-1}(K(t)=0)表示连续静止帧序列中的任意中间帧。
例如某一子帧序列如表1所示:
表1
标记 1 0 1 0 0 0 1 1 0 1 0 0
帧序号 1 2 3-20 21 22 23 24 25 26-35 36-65 66 67
是否抽帧
在表1中,帧序号3-20对应有连续活动帧,帧序号26-35对应有连续静止帧,帧序号36-65对应有连续活动帧。若第二阈值和第三阈值为4,则可抽取帧序号3-20对应的连续活动帧、帧序号26-35对应的连续静止帧以及帧序号36-65对应的连续活动帧中,除第一帧和最后一帧外的任意一帧图像帧或者连续多帧图像帧,从而将抽帧处理后的子帧序列确定为第一帧序列。
步骤S23、对第一帧序列进行编码,得到待处理视频的编码数据,编码数据携带各参考帧的参考帧序号。
在一些可行的实施方式中,在对初始帧序列进行抽帧处理得到第一帧序列之后,可对第一帧序列进行编码,得到待处理视频的编码数据。
具体地,在对第一帧序列进行编码时所采用的编码方式包括但不限于H.264、H.265、AVS2以及AV1等,具体可基于实际应用场景需求确定,在此不做限制。
其中,待处理视频的编码数据携带各参考帧的参考帧序号,各参考帧为与抽帧处理中抽取出的目标帧序列相邻的图像帧。即在从初始帧序列中抽取出目标帧序列得到第一帧序列之后,可将第一帧序列中与抽取出的目标帧序列相邻的两个图像帧确定为参考帧,并确定各参考帧的帧序号。
如表1所示的帧序号3至帧序号20的活动帧序列,若从该活动帧序列中抽取的目标帧序列为帧序号4至帧序号5的图像帧,则与该目标帧序列的相邻的图像帧为帧序号3的图像帧以及帧序号6的图像帧,并将其确定为两个参考帧。
在对第一帧序列进行编码时,可将各参考帧的参考帧序号与第一帧序列一同进行编码,从而使待处理视频的编码数据携带各参考帧序号。或者可对第一帧序列进行编码得到待处理视频的编码数据之后,再将各参考帧序号与编码数据进行进一步处理,以使待处理视频的编码数据携带各参考帧序号。
步骤S24、向第一设备发送编码数据,以使第一设备基于编码数据和各参考帧序号对应的参考帧,确定第二帧序列,并基于第二帧序列确定播放视频。
在一些可行的实施方式中,在得到携带各参考帧的参考帧序号的编码数据后,可向第一设备200发送编码数据,使得第一设备200可基于编码数据和各参考帧序号对应的参考帧来确定第二帧序列,并基于第二帧序列确定播放视频。
具体地,第二设备100向第一设备200发送编码数据的具体方式包括但不限于内容分发网络(Content Delivery Network,CDN)传输技术、对等(Peer-to-peer,P2P)网络传输技术以及CDN和P2P相结合的PCDN传输技术。
其中,第二设备100向第一设备200发送编码数据的同时,将各参考帧的帧序号一并向第一设备200发送。
在本申请实施例中,通过对待处理视频的初始帧序列进行抽帧处理,可将高帧率的初始帧序列转换为低帧率的第一帧序列,从而对第一帧序列进行 编码会大幅减小视频数据的大小,进而相应减少和编码数据传输所消耗的数据流量,从而达到节省带宽成本的作用。
在一些可行的实施方式中,第二设备100在向第一设备200发送携带各参考帧序号的编码数据之后,第一设备200对编码数据的具体处理方式可参见图3,图3是本申请实施例提供的视频数据处理方法的另一流程示意图。本申请实施例提供的视频数据处理方法在应用于第一设备200时,可具体包括如下步骤:
步骤S31、获取第二设备100发送的编码数据,对编码数据进行解码得到第一帧序列。
在一些可行的实施方式中,第一设备200在获取到第二设备100发送的编码数据之后,可基于第二设备100采用的编码技术对编码数据进行解码,得到第一帧序列。其中,第一帧序列是第二设备100对待处理视频的初始帧序列进行抽帧处理后得到的帧序列,即从初始帧序列中抽取出目标帧序列后的剩余帧序列。
步骤S32、基于编码数据携带的各参考帧序号,确定第一帧序列对应的参考帧中的各组关联参考帧。
在一些可行的实施方式中,第一设备200在对编码数据进行解码后得到的第一帧序列中各图像帧的序号为其对应于待处理视频的初始帧序列中的帧序号。基于此,第一设备200在获取到各参考帧序号之后,可基于各参考帧序号确定第一帧序列中的参考帧。
进一步地,第一设备200可从第一帧序列对应的各参考帧中确定出各组关联参考帧,每组关联参考帧包括与第二设备100在对初始帧序列进行抽帧处理过程中,抽取出的目标帧序列相邻的两个参考帧。进而第一设备200可确定每组关联参考帧所对应的目标预测帧,并将目标预测帧进行插帧处理得到第二帧序列。其中,第一设备200确定每组关联参考帧所对应的目标预测帧,并将目标预测帧进行插帧处理得到第二帧序列的具体实现方式详见下述,在此不做说明。
步骤S33、对于每组关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定第一参考帧对应的第一预测帧和第二参考帧对应的第二 预测帧、以及帧预测过程中的遮挡权重和重建残差,基于第一预测帧、第二预测帧、遮挡权重和重建残差,确定该组关联参考帧对应的目标预测帧。
在一些可行的实施方式中,对于每组关联参考帧,可基于该组关联参考帧中的第一参考帧和第二参考帧,确定第一参考帧对应的第一预测帧和第二参考帧对应的第二预测帧。
其中,对于每组关联参考帧,该组关联参考帧中的第一参考帧为参考帧序号较小的参考帧,第二参考帧为参考帧序号较大的参考帧。第一预测帧和第二预测帧均为第一参考帧和第二参考帧之间的图像帧。
具体地,对于每组关联参考帧,可基于该组关联参考帧中的第一参考帧和第二参考帧,确定第一参考帧对应的第一光流场和第二参考帧对应的第二光流场。
其中,光流场是指图像中所有像素点构成的一种二维瞬时速度场,其包含了像素在时域上的变化以及相邻帧之间的相关性,通过该相关性来找到上一帧跟当前帧之间存在的对应关系。
对于每组关联参考帧,在确定第一参考帧和第二参考帧对应的光流场时,可对第一参考帧进行特征提取得到第一初始特征,对第二参考帧进行特征提取,得到第二初始特征,进而基于第一初始特征和第二初始特征,得到第一参考帧和第二参考帧对应的关联特征。
其中,在对第一参考帧和第二参考帧进行特征提取时,可基于神经网络等对每一参考帧进行特征提取,得到对应的初始特征。在得到第一初始特征和第二初始特征之后,可基于特征拼接、特征融合或者基于其他神经网络模型的进一步处理,得到第一初始特征和第二初始特征的关联特征,具体实现方式可基于实际应用场景需求确定,在此不做限制。
对于第一参考帧,可确定第一参考帧的第一上下文特征,基于第一上下文特征和上述关联特征,可确定第一参考帧对应的第一光流场。对于第二参考帧,可确定第二参考帧的第二上下文特征,基于第二上下文特征和上述关联特征,可确定第二参考帧对应的第二光流场。
其中,每一参考帧的上下文特征可基于上下文特征提取网络实现。参见图4,图4是本申请实施例提供的确定上下文特征的一场景示意图。在图4所 示的上下文特征提取网络中包括多个串联的卷积层和激活函数组合。对于第一参考帧和第二参考帧中的任一参考帧,可基于第一个卷积层对该参考帧进行卷积处理,得到第一卷积特征,并通过第一个激活函数对第一卷积特征进行处理得到该参考帧的一个特征图。进一步地,继续基于第二个卷积层对上一个激活函数得到的特征图进行卷积处理,得到第二卷积特征,并通过第二个激活函数对第二卷积特征进行处理得到该参考帧的第二个特征图。以此类推,可将图4中各激活函数得到的特征图确定为该参考帧的上下文特征。
其中,图4所示的特征提取网络中的卷积层和激活函数的数量具体可基于实际应用场景需求确定,在此不做限制。
其中,确定各组关联参考帧的第一参考帧和第二参考帧对应的光流场时,可基于光流场估计模型(Recurrent All-Pairs Field Transforms for Optical Flow,RAFT)确定。
作为一示例,参见图5,图5是本申请实施例提供的光流场估计模型的一结构示意图。如图5所示,对于任一组关联参考帧,将该组关联参考帧中第一参考帧I a和第二参考帧I d输入特征编码模块,以基于特征编码模块分别对第一参考帧I a和第二参考帧I d进行特征提取,得到第一初始特征和第二初始特征,进一步基于第一初始特征和第二初始特征进行特征关联,得到关联特征。其中,I表示参考帧,a和d分别为参考帧的位置信息(如帧序号、时域位置等)。
对于第一参考帧I a,可基于上下文特征提取网络确定第一参考帧I a的第一上下文特征C a。其中,第一上下文特征C a可表示为
Figure PCTCN2021142584-appb-000003
Figure PCTCN2021142584-appb-000004
以及
Figure PCTCN2021142584-appb-000005
分别为基于一个卷积层和激活函数得到的特征图。将第一参考帧I a的第一上下文特征C a和关联特征输入循环神经网络,得到第一参考帧I a的第一光流场F b→a,其中,b为该组关联参考帧对应的目标预测帧的位置信息,其中a大于b大于d。
同理,对于第二参考帧I d,可基于上下文特征提取网络确定第二参考帧I d的第二上下文特征C d。其中,第二上下文特征C d可表示为
Figure PCTCN2021142584-appb-000006
Figure PCTCN2021142584-appb-000007
以及
Figure PCTCN2021142584-appb-000008
分别为基于一个卷积层和激活函数得到的特征图。将第二参考帧I d的第二上下文特征C d和关联特征输入循环神经网络,得到第二 参考帧I d的第二光流场F b→d,其中,b为该组关联参考帧对应的目标预测帧的位置信息,其中a大于b大于d。
进一步地,基于第一光流场对第一参考帧进行后向映射,得到第一参考帧对应的第一预测帧,基于第二光流场对第二参考帧进行后向映射,得到第二参考帧对应的第二预测帧。如基于第一光流场F b→a对第一参考帧I a进行后向映射,得到第一预测帧
Figure PCTCN2021142584-appb-000009
基于第二光流场F b→d对第二参考帧I d进行后向映射,得到第二预测帧
Figure PCTCN2021142584-appb-000010
在一些可行的实施方式中,对于每组关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定帧预测过程中的遮挡权重和重建残差。其中,重建残差用于减少帧预测过程中的梯度下降问题,遮挡权重用于减少帧预测过程中运动物体抖动以及边缘模糊等带来的影响。
具体地,可先确定第一参考帧的第三上下文特征和第二参考帧的第四上下文特征。其中,第一参考帧的第三上下文特征和第二参考帧的第四上下文特征可基于图4所示的方式确定,也可基于其他上下文特征提取网络确定,具体可基于实际应用场景需求确定,在此不做限制。
进一步地,可基于第一光流场、第二光流场、第一预测帧、第二预测帧、第三上下文特征以及第四上下文特征,确定帧预测过程中的遮挡权重和重建残差。如将第一光流场、第二光流场、第一预测帧、第二预测帧、第三上下文特征以及第四上下文特征输入深度神经网络中,得到帧预测过程中的遮挡权重和重建残差。其中,上述深度神经网络包括但不限于FusionNet和U-Net,具体可基于实际应用场景需求确定,在此不做限制。
在一些可行的实施方式中,在基于第一光流场、第二光流场、第一预测帧、第二预测帧、第三上下文特征以及第四上下文特征,确定帧预测过程中的遮挡权重和重建残差时,可先基于第一光流场、第二光流场、第一预测帧和第二预测帧,确定残差特征。
作为一示例,参见图6,图6是本申请实施例提供的确定残差特征的一场景示意图。如图6所示,将第一光流场、第二光流场、第一预测帧和第二预测帧输入卷积层,通过卷积神经网络和激活函数进行处理后,将处理结果输入至残差块中得到残差特征。
进一步地,可基于第三上下文特征、第四上下文特征以及残差特征,确定融合特征。参见图7,图7是本申请实施例提供的确定融合特征的一场景示意图。如图7所示,将残差特征与第三上下文特征和第四上下文特征中的第一个上下文特征
Figure PCTCN2021142584-appb-000011
Figure PCTCN2021142584-appb-000012
进行拼接得到第一拼接特征,将第一拼接特征输入卷积层中进行下采样卷积处理,得到第一卷积特征。将第一卷积特征与第三上下文特征和第四上下文特征中的第二个上下文特征
Figure PCTCN2021142584-appb-000013
Figure PCTCN2021142584-appb-000014
进行拼接得到第二拼接特征,将第二拼接特征输入卷积层中进行下采样卷积处理,得到第二卷积特征。将第二卷积特征与第三上下文特征和第四上下文特征中的第三个上下文特征
Figure PCTCN2021142584-appb-000015
Figure PCTCN2021142584-appb-000016
进行拼接得到第三拼接特征,将第三拼接特征输入卷积层中进行下采样卷积处理,得到第三卷积特征。将第三卷积特征与第三上下文特征和第四上下文特征中的第四个上下文特征
Figure PCTCN2021142584-appb-000017
Figure PCTCN2021142584-appb-000018
进行拼接得到第四拼接特征,将第四拼接特征输入卷积层中进行下采样卷积处理,得到第四卷积特征。将第四卷积特征与第三上下文特征和第四上下文特征中的第五个上下文特征
Figure PCTCN2021142584-appb-000019
Figure PCTCN2021142584-appb-000020
进行拼接得到第五拼接特征,将第五拼接特征输入卷积层中进行上采样卷积处理,得到第五卷积特征。
进一步地,将第五卷积特征和第三卷积特征进行拼接得到第六拼接特征,将第六拼接特征输入至卷积层进行上采样处理,得到第六卷积特征。将第六卷积特征和第二卷积特征进行拼接得到第七拼接特征,将第七拼接特征输入至卷积层进行上采样处理,得到第七卷积特征。将第七卷积特征和第一卷积特征进行拼接得到第八拼接特征,将第八拼接特征输入至卷积层进行上采样处理,得到第九卷积特征。将第九卷积特征和残差特征进行拼接可得到融合特征。
其中,在图7所示的确定融合特征的方法中,用于进行上采样处理的卷积层和用于进行下采样处理的卷积层的数量相同,具体数量与第三上下文特征或者第四上下文特征中的特征图的数量一致。
在一些可行的实施方式中,在基于第三上下文特征、第四上下文特征以及残差特征确定融合特征时,可基于第一光流场对第三上下文特征中各特征图进行后向映射,得到第五上下文特征,基于第二光流场对第四上下文特征中各特征图进行后向映射,得到第六上下文特征。进而基于第五上下文特征、 第六上下文特征以及残差特征,确定融合特征,具体确定方式同图7所示的实现方式,在此不再说明。
作为一示例,参见图8,图8是本申请实施例提供的确定上下文特征的另一场景示意图。如图8所示,在基于各卷积层和激活参数的组合确定第一参考帧的各特征图,得到第三上下文特征后,可基于第一光流场对第三上下文特征中的各特征图分别进行后向映射,得到各特征图对应的映射特征图,进而将各映射特征图确定为第五上下文特征。
可选地,由于第一参考帧的第三上下文特征、以及第二参考帧的第四上下文特征中各特征图的大小不同,因此对于第三上下文特征中的每一特征图,可确定该特征图对应的光流场权重,以基于该光流场权重和第一光流场确定对该特征图进行后向映射时所对应的新的光流场。进而对于第三上下文特征中的每一特征图,可基于该特征图对应的新的光流场对该特征图进行后向映射,得到该特征图对应的映射特征图。进而基于第三上下文特征中各特征图对应的映射特征图,确定第五上下文特征。
作为一示例,参见图9,图9是本申请实施例提供的确定上下文特征的又一场景示意图。如图9所示,在得到第一参考帧的第三上下文特征之后,可确定第三上下文特征中各特征图对应的光流场权重,如1、0.5、0.25、0.125、0.0625等,进而基于第一光流场和各光流场权重确定各特征图对应的新的光流场,如光流场1、光流场2、光流场3、光流场4以及光流场5。进而基于各特征图各自对应的新的光流场对各特征图进行后向映射,得到各特征图对应的映射特征图,并将各映射特征图确定为第五上下文特征。
同理,对于第四上下文特征中的每一特征图,可确定该特征图对应的光流场权重,以基于该光流场权重和第二光流场确定对该特征图进行后向映射时所对应的新的光流场。进而对于第四上下文特征中的每一特征图,可基于该特征图对应的新的光流场对该特征图进行后向映射,得到该特征图对应的映射特征图。进而基于第四上下文特征中各特征图对应的映射特征图,确定第六上下文特征。
其中,第三上下文特征和第四上下文特征中各特征图对应的光流场权重具体可基于实际应用场景确定,在此不做限制。
在一些可行的实施方式中,在基于重建残差确定出融合特征之后,可对融合特征进行进一步处理,得到目标特征。具体地,如图10所示,图10是本申请实施例提供的确定重建残差和遮挡权重的场景示意图。可将融合特征输入至卷积层以对融合特征进行进一步处理,并将处理结果进行子像素卷积,得到高分辨率的目标特征。进而基于目标特征确定帧预测过程中的遮挡权重和重建残差。
具体地,在基于目标特征确定帧预测过程中的遮挡权重和重建残差时,可确定目标特征对应的通道数以及各对应于各通道的特征值。进而将最后一个通道的特征值确定为帧预测过程中的遮挡权重,基于其他通道特征值确定帧预测过程中的重建残差。如将除最后一个通道外其他通道对应的特征值进行拼接,得到帧预测过程中的重建残差。
下面结合图11对基于第一光流场、第二光流场、第一预测帧、第二预测帧、第三上下文特征以及第四上下文特征,确定帧预测过程中的遮挡权重和重建残差进行进一步说明。图11是本申请实施例提供的确定遮挡权重和重建残差的另一场景示意图。即通过图6所示的方式,基于第一光流场、第二光流场、第一预测帧和第二预测帧确定残差特征,通过图7所示的方式,基于残差特征和第三上下文特征以及第四上下文特征中各特征图,确定融合特征。进而通过图10所示的方式,基于残差特征确定帧预测过程中的重建残差和遮挡权重。
在一些可行的实施方式中,在确定出帧预测过程中的遮挡权重和重建残差之后,可基于第一预测帧、第二预测帧、遮挡权重和重建残差,确定该组关联参考帧对应的目标预测帧。具体确定方式可如下所示:
Figure PCTCN2021142584-appb-000021
其中,
Figure PCTCN2021142584-appb-000022
表示第一预测帧,
Figure PCTCN2021142584-appb-000023
表示第二预测帧,M表示遮挡权重,Δ表示重建残差,
Figure PCTCN2021142584-appb-000024
表示目标预测帧,⊙表示点乘运算。
步骤S34、将各组关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于第二帧序列得到播放视频。
在一些可行的实施方式中,对于每组关联参考帧,该关联参考帧对应的目标预测帧,为该组关联参考帧的第一参考帧和第二参考帧之间的目标预测 帧。基于此,可将各组关联参考帧对应的目标预测帧进行插帧处理,将每一关联参考帧对应的目标预测帧插入该关联参考帧的第一参考帧和第二参考帧之间,从而在第一帧序列的基础之上得到第二帧序列。
进一步地,第一设备在得到第二帧序列之后,可基于第二帧序列确定播放视频,即第二帧序列即为第一设备所播放的视频所对应的帧序列。
下面结合图12对本申请实施例提供的确定目标预测帧的一场景示意图。图12是本申请实施例提供的确定目标预测帧的一场景示意图。如图12所示,通过RAFT模型确定第一参考帧对应的第一光流场和第二参考帧对应的第二光流场,并基于第一光流场对第一参考帧进行后向映射得到第一预测帧,基于第二光流场对第二参考帧进行后向映射得到第二预测帧。
通过上下文特征提取网络(ContextNet)分别确定第一参考帧对应的第三上下文特征和第二参考帧对应的第四上下文特征,并基于第一光流场对第三上下文特征中各特征图进行后向映射,得到第五上下文特征,基于第二光流场对第四上下文特征中各特征图进行后向映射,得到第六上下文特征。
将第五上下文特征、第六上下文特征、第一光流场、第二光流场、第一预测帧以及第二预测帧输入U-NET网络,得到帧预测过程中的重建残差和遮挡权重,进而基于重建残差、遮挡权重、第一预测帧和第二预测帧确定出目标预测帧。
在本申请实施例中,对于解码得到的第一帧序列中每组关联参考帧,通过确定每组关联参考帧中的第一参考帧和第二参考帧所对应的第一预测帧和第二预测帧、第一光流场和第二光流场、以及帧预测过程中的遮挡权重和重建残差,可充分考虑帧预测过程中的遮挡信息、各图像帧的细节信息以及光流场信息,可有效解决帧预测过程中的物体抖动、边缘模糊等问题,从而提升视频清晰度,提升视频观看体验。
参见图13,图13是本申请实施例提供的视频数据处理装置的一结构示意图。本申请实施例提供的视频数据处理装置包括:
亮度确定模块41,用于确定待处理视频的初始帧序列,确定上述初始帧序列中各图像帧的各像素点的亮度;
帧序列确定模块42,用于基于上述初始帧序列中各图像帧的各像素点的亮度,对上述初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列;
编码模块43,用于对上述第一帧序列进行编码,得到上述待处理视频的编码数据,上述编码数据携带各参考帧的参考帧序号,上述各参考帧为与上述抽帧处理中抽取出的目标帧序列相邻的图像帧;
发送模块44,用于向第一设备发送上述编码数据,以使上述第一设备基于上述编码数据和上述各参考帧序号对应的参考帧,确定第二帧序列,并基于上述第二帧序列确定播放视频。
在一些可行的实施方式中,上述帧序列确定模块42,用于:
对于上述初始帧序列中的任一图像帧,确定该图像帧的各像素点与前一图像帧的对应像素点的亮度差,基于该图像帧在各像素点对应的亮度差,确定该图像帧与上述前一图像帧的总亮度差;
将上述初始帧序列中,对应的总亮度差大于第一阈值的图像帧确定为活动帧,对应的总亮度差小于或者等于上述第一阈值的图像帧确定为静止帧;
基于上述活动帧和上述静止帧,对上述初始帧序列进行抽帧处理。
在一些可行的实施方式中,上述帧序列确定模块42,用于:
确定上述初始帧序列中的连续活动帧序列和连续静止帧序列;
对上述初始帧序列中的连续活动帧序列和连续静止帧序列进行抽帧处理。
在一些可行的实施方式中,上述帧序列确定模块42,用于:
确定上述待处理视频对应的初始帧率,基于上述初始帧率将上述初始帧序列划分为至少一个子帧序列;
确定各上述子帧序列中的连续活动帧序列和连续静止帧序列;
对于每一上述连续活动帧序列,若该连续活动帧序列中的活动帧的数量大于第二阈值,则对该连续活动帧序列进行抽帧处理;
对于每一上述连续静止帧序列,若该连续静止帧序列中的静止帧的数量大于第三阈值,则对该连续静止帧序列进行抽帧处理。
具体实现中,上述视频数据处理装置可通过其内置的各个功能模块执行如上述图1中各个步骤所提供的实现方式,具体可参见上述各个步骤所提供的实现方式,在此不再赘述。
参见图14,图14是本申请实施例提供的视频数据处理装置的另一结构示意图。本申请实施例提供的视频数据处理装置包括:
解码模块51,用于获取第二设备发送的编码数据,对上述编码数据进行解码得到第一帧序列,上述第一帧序列是上述第二设备对待处理视频的初始帧序列进行抽帧处理后的帧序列;
参考帧确定模块52,用于基于上述编码数据携带的各参考帧序号,确定上述第一帧序列对应的参考帧中的各组关联参考帧,每组上述关联参考帧包括与上述抽帧处理中抽取出的目标帧序列相邻的两个参考帧;
帧预测模块53,用于对于每组上述关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定上述第一参考帧对应的第一预测帧和上述第二参考帧对应的第二预测帧、以及帧预测过程中的遮挡权重和重建残差,基于上述第一预测帧、上述第二预测帧、上述遮挡权重和上述重建残差,确定该组关联参考帧对应的目标预测帧;
视频确定模块54,用于将各组上述关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于上述第二帧序列得到播放视频。
在一些可行的实施方式中,对于每组上述关联参考帧,上述帧预测模块53,用于:
基于该组关联参考帧中的第一参考帧和第二参考帧,确定上述第一参考帧对应的第一光流场以及上述第二参考帧对应的第二光流场;
基于上述第一光流场对上述第一参考帧进行后向映射,得到上述第一参考帧对应的第一预测帧,基于上述第二光流场对上述第二参考帧进行后向映射,得到上述第二参考帧对应的第二预测帧。
在一些可行的实施方式中,对于每组上述关联参考帧,上述帧预测模块53,用于:
对该组关联参考帧中的第一参考帧进行特征提取,得到第一初始特征,对该组关联参考帧中的第二参考帧进行特征提取,得到第二初始特征,基于上述第一初始特征和上述第二初始特征,确定上述第一参考帧和上述第二参考帧对应的关联特征;
确定上述第一参考帧的第一上下文特征,基于上述第一上下文特征和上述关联特征,确定上述第一参考帧对应的第一光流场;
确定上述第二参考帧的第二上下文特征,基于上述第二上下文特征和上述关联特征,确定上述第二参考帧对应的第二光流场。
在一些可行的实施方式中,对于每组上述关联参考帧,上述帧预测模块53,用于:
确定上述第一参考帧的第三上下文特征,确定上述第二参考帧的第四上下文特征;
基于上述第一光流场、上述第二光流场、上述第一预测帧、上述第二预测帧、上述第三上下文特征以及上述第四上下文特征,确定帧预测过程中的遮挡权重和重建残差。
在一些可行的实施方式中,上述帧预测模块53,用于:
基于上述第一光流场、上述第二光流场、上述第一预测帧和上述第二预测帧,确定残差特征;
基于上述第三上下文特征、上述第四上下文特征以及上述残差特征,确定融合特征;
基于上述融合特征确定帧预测过程中的遮挡权重和重建残差。
在一些可行的实施方式中,上述第三上下文特征和上述第四上下文特征包括多个特征图;上述帧预测模块53,用于:
对于上述第三上下文特征中的每一上述特征图,确定该特征图对应的光流场权重,基于该特征图对应的上述光流场权重和上述第一光流场,对该特征图进行后向映射,得到该特征图对应的映射特征图;
将上述第三上下文特征中的各上述特征图对应的映射特征图,确定为上述第一参考帧的第五上下文特征;
对于上述第四上下文特征中的每一上述特征图,确定该特征图对应的光流场权重,基于该特征图对应的上述光流场权重和上述第二光流场,对该特征图进行后向映射,得到该特征图对应的映射特征图;
将上述第四上下文特征中的各上述特征图对应的映射特征图,确定为上述第二参考帧的第六上下文特征;
基于上述第五上下文特征、上述第六上下文特征以及上述残差特征,确定融合特征。
在一些可行的实施方式中,上述帧预测模块53,用于:
对上述融合特征进行特征处理,得到目标特征,并确定上述目标特征对应的通道数;
将上述目标特征对应于最后一个通道的特征值确定为遮挡权重;
基于上述目标特征对应于其他通道的特征值,确定重建残差。
具体实现中,上述视频数据处理装置可通过其内置的各个功能模块执行如上述图3中各个步骤所提供的实现方式,具体可参见上述各个步骤所提供的实现方式,在此不再赘述。
参见图15,图15是本申请实施例提供的电子设备的结构示意图。如图15所示,本实施例中的电子设备1000可以包括:处理器1001,网络接口1004和存储器1005,此外,上述电子设备1000还可以包括:用户接口1003,和至少一个通信总线1002。其中,通信总线1002用于实现这些组件之间的连接通信。其中,用户接口1003可以包括显示屏(Display)、键盘(Keyboard),可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如WI-FI接口)。存储器1004可以是高速RAM存储器,也可以是非不稳定的存储器(non-volatile memory),例如至少一个磁盘存储器。存储器1005可选的还可以是至少一个位于远离前述处理器1001的存储装置。如图15所示,作为一种计算机可读存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及设备控制应用程序。
在图15所示的电子设备1000中,网络接口1004可提供网络通讯功能;而用户接口1003主要用于为用户提供输入的接口;而处理器1001可以用于调用存储器1005中存储的设备控制应用程序,以实现第一设备和/或第二设备所执行的视频数据处理方法。
应当理解,在一些可行的实施方式中,上述处理器1001可以是中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field-programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。该存储器可以包括只读存储器和随机存取存储器,并向处理器提供指令和数据。存储器的一部分还可以包括非易失性随机存取存储器。例如,存储器还可以存储设备类型的信息。
具体实现中,上述电子设备1000可通过其内置的各个功能模块执行如上 述图2和/或图3中各个步骤所提供的实现方式,具体可参见上述各个步骤所提供的实现方式,在此不再赘述。
本申请实施例还提供一种计算机可读存储介质,该计算机可读存储介质存储有计算机程序,被处理器执行以实现图2和/或图3中各个步骤所提供的方法,具体可参见上述各个步骤所提供的实现方式,在此不再赘述。
上述计算机可读存储介质可以是前述任一实施例提供的视频数据处理装置和/或电子设备的内部存储单元,例如电子设备的硬盘或内存。该计算机可读存储介质也可以是该电子设备的外部存储设备,例如该电子设备上配备的插接式硬盘,智能存储卡(smart media card,SMC),安全数字(secure digital,SD)卡,闪存卡(flash card)等。上述计算机可读存储介质还可以包括磁碟、光盘、只读存储记忆体(read-only memory,ROM)或随机存储记忆体(random access memory,RAM)等。进一步地,该计算机可读存储介质还可以既包括该电子设备的内部存储单元也包括外部存储设备。该计算机可读存储介质用于存储该计算机程序以及该电子设备所需的其他程序和数据。该计算机可读存储介质还可以用于暂时地存储已经输出或者将要输出的数据。
本申请实施例提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机指令,该计算机指令存储在计算机可读存储介质中。电子设备的处理器从计算机可读存储介质读取该计算机指令,处理器执行该计算机指令,使得该计算机设备执行图2和/或图3中各个步骤所提供的方法。
本申请的权利要求书和说明书及附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或电子设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或电子设备固有的其它步骤或单元。在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置展示该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个 或多个的任何组合以及所有可能组合,并且包括这些组合。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
以上所揭露的仅为本申请较佳实施例而已,不能以此来限定本申请之权利范围,因此依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (15)

  1. 一种视频数据处理方法,所述方法包括:
    确定待处理视频的初始帧序列,确定所述初始帧序列中各图像帧的各像素点的亮度;
    基于所述初始帧序列中各图像帧的各像素点的亮度,对所述初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列;
    对所述第一帧序列进行编码,得到所述待处理视频的编码数据,所述编码数据携带各参考帧的参考帧序号,各所述参考帧为与所述抽帧处理中抽取出的目标帧序列相邻的图像帧;
    向第一设备发送所述编码数据,以使所述第一设备基于所述编码数据和各所述参考帧序号对应的参考帧,确定第二帧序列,并基于所述第二帧序列确定播放视频。
  2. 根据权利要求1所述的方法,其中,所述基于所述初始帧序列中各图像帧的各像素点的亮度,对所述初始帧序列进行抽帧处理,包括:
    对于所述初始帧序列中的任一图像帧,确定该图像帧的各像素点与前一图像帧的对应像素点的亮度差,基于该图像帧在各像素点对应的亮度差,确定该图像帧与所述前一图像帧的总亮度差;
    将所述初始帧序列中,对应的总亮度差大于第一阈值的图像帧确定为活动帧,对应的总亮度差小于或者等于所述第一阈值的图像帧确定为静止帧;
    基于所述活动帧和所述静止帧,对所述初始帧序列进行抽帧处理。
  3. 根据权利要求2所述的方法,其中,所述基于所述活动帧和所述静止帧,对所述初始帧序列进行抽帧处理,包括:
    确定所述初始帧序列中的连续活动帧序列和连续静止帧序列;
    对所述初始帧序列中的连续活动帧序列和连续静止帧序列进行抽帧处理。
  4. 根据权利要求3所述的方法,其中,所述确定所述初始帧序列中的连续活动帧序列和连续静止帧序列,包括:
    确定所述待处理视频对应的初始帧率,基于所述初始帧率将所述初始帧序列划分为至少一个子帧序列;
    确定各所述子帧序列中的连续活动帧序列和连续静止帧序列;
    所述对所述初始帧序列中的连续活动帧序列和连续静止帧序列进行抽帧处理,包括:
    对于每一所述连续活动帧序列,若该连续活动帧序列中的活动帧的数量大于第二阈值,则对该连续活动帧序列进行抽帧处理;
    对于每一所述连续静止帧序列,若该连续静止帧序列中的静止帧的数量大于第三阈值,则对该连续静止帧序列进行抽帧处理。
  5. 一种视频数据处理方法,所述方法包括:
    获取第二设备发送的编码数据,对所述编码数据进行解码得到第一帧序列,所述第一帧序列是所述第二设备对待处理视频的初始帧序列进行抽帧处理后的帧序列;
    基于所述编码数据携带的各参考帧序号,确定所述第一帧序列对应的参考帧中的各组关联参考帧,每组所述关联参考帧包括与所述抽帧处理中抽取出的目标帧序列相邻的两个参考帧;
    对于每组所述关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定所述第一参考帧对应的第一预测帧和所述第二参考帧对应的第二预测帧、以及帧预测过程中的遮挡权重和重建残差,基于所述第一预测帧、所述第二预测帧、所述遮挡权重和所述重建残差,确定该组关联参考帧对应的目标预测帧;
    将各组所述关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于所述第二帧序列得到播放视频。
  6. 根据权利要求5所述的方法,其中,对于每组所述关联参考帧,所述基于该组关联参考帧中的第一参考帧和第二参考帧,确定所述第一参考帧对应的第一预测帧和所述第二参考帧对应的第二预测帧,包括:
    基于该组关联参考帧中的第一参考帧和第二参考帧,确定所述第一参考帧对应的第一光流场以及所述第二参考帧对应的第二光流场;
    基于所述第一光流场对所述第一参考帧进行后向映射,得到所述第一参考帧对应的第一预测帧,基于所述第二光流场对所述第二参考帧进行后向映射,得到所述第二参考帧对应的第二预测帧。
  7. 根据权利要求6所述的方法,其中,对于每组所述关联参考帧,所述基于该组关联参考帧中的第一参考帧和第二参考帧,确定所述第一参考帧对应的第一光流场以及所述第二参考帧对应的第二光流场,包括:
    对该组关联参考帧中的第一参考帧进行特征提取,得到第一初始特征,对该组关联参考帧中的第二参考帧进行特征提取,得到第二初始特征,基于所述第一初始特征和所述第二初始特征,确定所述第一参考帧和所述第二参考帧对应的关联特征;
    确定所述第一参考帧的第一上下文特征,基于所述第一上下文特征和所述关联特征,确定所述第一参考帧对应的第一光流场;
    确定所述第二参考帧的第二上下文特征,基于所述第二上下文特征和所述关联特征,确定所述第二参考帧对应的第二光流场。
  8. 根据权利要求6所述的方法,其中,对于每组所述关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定帧预测过程中的遮挡权重和重建残差,包括:
    确定所述第一参考帧的第三上下文特征,确定所述第二参考帧的第四上下文特征;
    基于所述第一光流场、所述第二光流场、所述第一预测帧、所述第二预测帧、所述第三上下文特征以及所述第四上下文特征,确定帧预测过程中的遮挡权重和重建残差。
  9. 根据权利要求8所述的方法,其中,所述基于所述第一光流场、所述第二光流场、所述第一预测帧、所述第二预测帧、所述第三上下文特征以及所述第四上下文特征,确定帧预测过程中的遮挡权重和重建残差,包括:
    基于所述第一光流场、所述第二光流场、所述第一预测帧和所述第二预测帧,确定残差特征;
    基于所述第三上下文特征、所述第四上下文特征以及所述残差特征,确定融合特征;
    基于所述融合特征确定帧预测过程中的遮挡权重和重建残差。
  10. 根据权利要求9所述的方法,其中,所述第三上下文特征和所述第四上下文特征包括多个特征图;所述基于所述第三上下文特征、所述第四上下文特征以及所述残差特征,确定融合特征,包括:
    对于所述第三上下文特征中的每一所述特征图,确定该特征图对应的光流场权重,基于该特征图对应的所述光流场权重和所述第一光流场,对该特征图进行后向映射,得到该特征图对应的映射特征图;
    将所述第三上下文特征中的各所述特征图对应的映射特征图,确定为所述第一参考帧的第五上下文特征;
    对于所述第四上下文特征中的每一所述特征图,确定该特征图对应的光流场权重,基于该特征图对应的所述光流场权重和所述第二光流场,对该特征图进行后向映射,得到该特征图对应的映射特征图;
    将所述第四上下文特征中的各所述特征图对应的映射特征图,确定为所述第二参考帧的第六上下文特征;
    基于所述第五上下文特征、所述第六上下文特征以及所述残差特征,确定融合特征。
  11. 根据权利要求9所述的方法,其中,所述基于所述融合特征确定帧预测过程中的遮挡权重和重建残差,包括:
    对所述融合特征进行特征处理,得到目标特征,并确定所述目标特征对 应的通道数;
    将所述目标特征对应于最后一个通道的特征值确定为遮挡权重;
    基于所述目标特征对应于其他通道的特征值,确定重建残差。
  12. 一种视频数据处理装置,所述装置包括:
    亮度确定模块,用于确定待处理视频的初始帧序列,确定所述初始帧序列中各图像帧的各像素点的亮度;
    帧序列确定模块,用于基于所述初始帧序列中各图像帧的各像素点的亮度,对所述初始帧序列进行抽帧处理,将抽帧后的帧序列作为第一帧序列;
    编码模块,用于对所述第一帧序列进行编码,得到所述待处理视频的编码数据,所述编码数据携带各参考帧的参考帧序号,各所述参考帧为与所述抽帧处理中抽取出的目标帧序列相邻的图像帧;
    发送模块,用于向第一设备发送所述编码数据,以使所述第一设备基于所述编码数据和各所述参考帧序号对应的参考帧,确定第二帧序列,并基于所述第二帧序列确定播放视频。
  13. 一种视频数据处理装置,所述装置包括:
    解码模块,用于获取第二设备发送的编码数据,对所述编码数据进行解码得到第一帧序列,所述第一帧序列是所述第二设备对待处理视频的初始帧序列进行抽帧处理后的帧序列;
    参考帧确定模块,用于基于所述编码数据携带的各参考帧序号,确定所述第一帧序列对应的参考帧中的各组关联参考帧,每组所述关联参考帧包括与所述抽帧处理中抽取出的目标帧序列相邻的两个参考帧;
    帧预测模块,用于对于每组所述关联参考帧,基于该组关联参考帧中的第一参考帧和第二参考帧,确定所述第一参考帧对应的第一预测帧和所述第二参考帧对应的第二预测帧、以及帧预测过程中的遮挡权重和重建残差,基于所述第一预测帧、所述第二预测帧、所述遮挡权重和所述重建残差,确定该组关联参考帧对应的目标预测帧;
    视频确定模块,用于将各组所述关联参考帧对应的目标预测帧进行插帧处理,得到第二帧序列,基于所述第二帧序列得到播放视频。
  14. 一种电子设备,包括处理器和存储器,所述处理器和存储器相互连接;
    所述存储器用于存储计算机程序;
    所述处理器被配置用于在调用所述计算机程序时,执行如权利要求1至4任一项所述的方法或者权利要求5至11任一项所述的方法。
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行以实现权利要求1至4任一项所述的方法或者权利要求5至11任一项所述的方法。
PCT/CN2021/142584 2021-07-30 2021-12-29 视频数据处理方法、装置、设备以及存储介质 WO2023005140A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110874693.7A CN113556582A (zh) 2021-07-30 2021-07-30 视频数据处理方法、装置、设备以及存储介质
CN202110874693.7 2021-07-30

Publications (1)

Publication Number Publication Date
WO2023005140A1 true WO2023005140A1 (zh) 2023-02-02

Family

ID=78105050

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/142584 WO2023005140A1 (zh) 2021-07-30 2021-12-29 视频数据处理方法、装置、设备以及存储介质

Country Status (2)

Country Link
CN (1) CN113556582A (zh)
WO (1) WO2023005140A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113556582A (zh) * 2021-07-30 2021-10-26 海宁奕斯伟集成电路设计有限公司 视频数据处理方法、装置、设备以及存储介质
CN114286123B (zh) * 2021-12-23 2024-07-16 海宁奕斯伟集成电路设计有限公司 电视节目的直播方法及装置
CN114449280B (zh) * 2022-03-30 2022-10-04 浙江智慧视频安防创新中心有限公司 一种视频编解码方法、装置及设备
CN117857817A (zh) * 2022-09-30 2024-04-09 中国电信股份有限公司 面向机器视觉的视频数据处理方法及相关设备
CN117315574B (zh) * 2023-09-20 2024-06-07 北京卓视智通科技有限责任公司 一种盲区轨迹补全的方法、系统、计算机设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200068253A1 (en) * 2018-08-23 2020-02-27 Dish Network L.L.C. Automated transition classification for binge watching of content
CN111901598A (zh) * 2020-06-28 2020-11-06 华南理工大学 视频解码与编码的方法、装置、介质及电子设备
CN112866799A (zh) * 2020-12-31 2021-05-28 百果园技术(新加坡)有限公司 一种视频抽帧处理方法、装置、设备及介质
CN113038176A (zh) * 2021-03-19 2021-06-25 北京字跳网络技术有限公司 视频抽帧方法、装置和电子设备
CN113556582A (zh) * 2021-07-30 2021-10-26 海宁奕斯伟集成电路设计有限公司 视频数据处理方法、装置、设备以及存储介质

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100755689B1 (ko) * 2005-02-14 2007-09-05 삼성전자주식회사 계층적 시간적 필터링 구조를 갖는 비디오 코딩 및 디코딩방법, 이를 위한 장치
CN104618679B (zh) * 2015-03-13 2018-03-27 南京知乎信息科技有限公司 一种监控视频中抽取关键信息帧的方法
CN107027029B (zh) * 2017-03-01 2020-01-10 四川大学 基于帧率变换的高性能视频编码改进方法
US11778195B2 (en) * 2017-07-07 2023-10-03 Kakadu R & D Pty Ltd. Fast, high quality optical flow estimation from coded video
CN109905717A (zh) * 2017-12-11 2019-06-18 四川大学 一种基于空时域下采样与重建的h.264/avc编码优化方法
CN112104830B (zh) * 2020-08-13 2022-09-27 北京迈格威科技有限公司 视频插帧方法、模型训练方法及对应装置
CN112184779A (zh) * 2020-09-17 2021-01-05 无锡安科迪智能技术有限公司 插帧图像处理方法及装置
CN112532998B (zh) * 2020-12-01 2023-02-21 网易传媒科技(北京)有限公司 抽取视频帧的方法、装置、设备和可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200068253A1 (en) * 2018-08-23 2020-02-27 Dish Network L.L.C. Automated transition classification for binge watching of content
CN111901598A (zh) * 2020-06-28 2020-11-06 华南理工大学 视频解码与编码的方法、装置、介质及电子设备
CN112866799A (zh) * 2020-12-31 2021-05-28 百果园技术(新加坡)有限公司 一种视频抽帧处理方法、装置、设备及介质
CN113038176A (zh) * 2021-03-19 2021-06-25 北京字跳网络技术有限公司 视频抽帧方法、装置和电子设备
CN113556582A (zh) * 2021-07-30 2021-10-26 海宁奕斯伟集成电路设计有限公司 视频数据处理方法、装置、设备以及存储介质

Also Published As

Publication number Publication date
CN113556582A (zh) 2021-10-26

Similar Documents

Publication Publication Date Title
WO2023005140A1 (zh) 视频数据处理方法、装置、设备以及存储介质
CN110324664B (zh) 一种基于神经网络的视频补帧方法及其模型的训练方法
CN110751649B (zh) 视频质量评估方法、装置、电子设备及存储介质
CN112991203B (zh) 图像处理方法、装置、电子设备及存储介质
WO2022141819A1 (zh) 视频插帧方法、装置、计算机设备及存储介质
CN111954053B (zh) 获取蒙版帧数据的方法、计算机设备及可读存储介质
WO2019001108A1 (zh) 视频处理的方法和装置
CN110222758B (zh) 一种图像处理方法、装置、设备及存储介质
WO2020220516A1 (zh) 图像生成网络的训练及图像处理方法、装置、电子设备、介质
CN110958469A (zh) 视频处理方法、装置、电子设备及存储介质
CN112584232A (zh) 视频插帧方法、装置及服务器
WO2021227704A1 (zh) 图像识别方法、视频播放方法、相关设备及介质
AU2019201358A1 (en) Real time overlay placement in videos for augmented reality applications
CN111985281A (zh) 图像生成模型的生成方法、装置及图像生成方法、装置
Wu et al. Towards robust text-prompted semantic criterion for in-the-wild video quality assessment
US20240223790A1 (en) Encoding and Decoding Method, and Apparatus
WO2022221205A1 (en) Video super-resolution using deep neural networks
Saha et al. Perceptual video quality assessment: The journey continues!
Agarwal et al. Compressing video calls using synthetic talking heads
CN114390307A (zh) 图像画质增强方法、装置、终端及可读存储介质
JP2023549210A (ja) ビデオフレーム圧縮方法、ビデオフレーム伸長方法及び装置
CN114173137A (zh) 视频编码方法、装置及电子设备
US20220217321A1 (en) Method of training a neural network configured for converting 2d images into 3d models
CN115375539A (zh) 图像分辨率增强、多帧图像超分辨率系统和方法
Lee et al. An image-guided network for depth edge enhancement

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21951712

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21951712

Country of ref document: EP

Kind code of ref document: A1