WO2022242122A1 - Video optimization method and apparatus, terminal device, and storage medium - Google Patents

Video optimization method and apparatus, terminal device, and storage medium Download PDF

Info

Publication number
WO2022242122A1
WO2022242122A1 PCT/CN2021/137583 CN2021137583W WO2022242122A1 WO 2022242122 A1 WO2022242122 A1 WO 2022242122A1 CN 2021137583 W CN2021137583 W CN 2021137583W WO 2022242122 A1 WO2022242122 A1 WO 2022242122A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
video
feature
video frame
frames
Prior art date
Application number
PCT/CN2021/137583
Other languages
French (fr)
Chinese (zh)
Inventor
刘翼豪
赵恒远
董超
乔宇
Original Assignee
中国科学院深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院深圳先进技术研究院 filed Critical 中国科学院深圳先进技术研究院
Publication of WO2022242122A1 publication Critical patent/WO2022242122A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/90Dynamic range modification of images or parts thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/90Determination of colour characteristics
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the present application relates to the technical field of deep learning, and in particular to a video optimization method, device, terminal equipment and storage medium.
  • Video optimization generally includes optimization operations such as video denoising, video rain removal, video super resolution, video color correction, and black and white video coloring.
  • image optimization models such as image denoising models, image deraining models, super-resolution models, image color toning models, black and white image coloring models, etc.
  • image optimization models are often used to extract each image in the video.
  • the intermediate features of a frame of video frame and perform feature estimation on the intermediate features of each frame of video frame, and obtain the optimized image corresponding to each frame of video frame, so as to realize the optimization of video.
  • this method of independently optimizing each video frame in the video based on the image optimization model may cause different video frames to have different optimization effects, affecting the continuity of the optimized video.
  • the present application provides a video optimization method, device, terminal equipment, and storage medium, so as to improve the continuity of the optimized video.
  • the present application provides a video optimization method, including:
  • the video frame sequence includes N frame video frames
  • the M frame anchor frame includes the first video frame of the video frame sequence.
  • M is a positive integer greater than 2 and less than N; use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frame respectively, and the forward optical flow parameter of the intermediate frame
  • the flow parameters are used to describe the transformation relationship from the previous frame of the intermediate frame to the intermediate frame, and the reverse optical flow parameters of the intermediate frame are used to describe the transformation relationship from the subsequent frame to the intermediate frame.
  • Optimizing the video frames except the anchor point frame in the video according to the forward optical flow parameter and the reverse optical flow parameter of the N-M frame intermediate frame, and the intermediate feature of the M frame anchor frame, determine the intermediate feature of the N-M frame intermediate frame; utilize The trained feature estimation network performs feature estimation on the intermediate features of each frame of the video frame sequence respectively to obtain N frames of optimized images, and the N frames of optimized images constitute the optimized video of the video frame sequence.
  • the intermediate feature of the intermediate frame of the N-M frame is determined, including:
  • the value of i is ⁇ 1, 2, ..., N-1, N ⁇
  • the i-th frame video frame is an intermediate frame: use the positive value of the i-th frame video frame Transform the shape of the intermediate features of the i-1th video frame to the optical flow parameters to obtain the forward features of the i-th video frame; use the reverse optical flow parameters of the i-th video frame to transform the i+1th video frame Transform the shape of the reverse feature of the i-th video frame to obtain the reverse feature of the i-th video frame; perform feature fusion on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the i-th video frame Intermediate features; Wherein, if the i+1th video frame is an anchor frame, the reverse feature value of the i+1th video frame is the intermediate feature of the i+1th video frame.
  • feature fusion is performed on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the intermediate features of the i-th video frame, including:
  • the i-1th video frame, the i-th video frame, the i+1th video frame, the forward feature of the i-th video frame, the reverse feature of the i-th video frame, the i-1th video frame The forward features of the i+1th video frame and the reverse features of the i+1th video frame are input into the trained FFM model for fusion processing, and the intermediate features of the i-th video frame are obtained, wherein, if the i-1th video frame is the anchor Point frame, the value of the forward feature of the i-1th frame video frame is the intermediate feature of the i-1th frame video frame.
  • fusion processing includes:
  • the forward feature of the frame and the reverse feature of the i+1th video frame are convoluted to obtain supplementary features; the supplementary features and weighted features are superimposed to obtain the intermediate features of the i-th video frame.
  • the method also includes:
  • the initial model of video optimization which includes the initial network of feature extraction, initial network of optical flow, initial network of feature estimation and initial model of FFM; use the preset loss function and training set to conduct unsupervised training on the initial model of video optimization, and get The trained feature extraction network, optical flow network, feature estimation network and FFM model; wherein, the training set includes a plurality of video frame sequence samples to be optimized.
  • the feature extraction network and the feature estimation network are obtained by splitting a preset image optimization model, and the image optimization model is used to perform image optimization on a two-dimensional image.
  • the image optimization model is an image coloring model
  • the video frame sequence includes N frames of grayscale images; for the i-th frame grayscale image in the video frame sequence, the value of i is ⁇ 1, 2 ,..., N-1, N ⁇ , use the feature estimation network to perform feature estimation on the intermediate features of the grayscale image of the i-th frame, and obtain the optimized image of the grayscale image of the i-th frame including:
  • the present application provides a video optimization device, including:
  • the extraction unit is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frame in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes the video frame sequence The first video frame and the Nth video frame, M is a positive integer greater than 2 and less than N;
  • the determining unit is used to determine the forward optical flow parameters and the reverse optical flow parameters of the N-M frame intermediate frame respectively by using the trained optical flow network, and the forward optical flow parameter of the intermediate frame is used to describe the direction of the previous frame of the intermediate frame to the
  • the transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the next frame of the intermediate frame to the intermediate frame transformation, and the intermediate frame is a video frame other than the anchor point frame in the video to be optimized;
  • the determination unit is also used to determine the intermediate features of the N-M frame intermediate frames according to the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames, and the intermediate features of the M frame anchor frame;
  • the estimation unit is used to use the trained feature estimation network to perform feature estimation processing on the intermediate features of the N frames of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute the optimized video of the video frame sequence.
  • the present application provides a terminal device, including: a memory and a processor, where the memory is used to store a computer program; and the processor is used to execute the method described in any one of the above first aspects when calling the computer program.
  • the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method described in any one of the above-mentioned first aspects is implemented.
  • an embodiment of the present application provides a computer program product, which, when the computer program product runs on a processor, causes the processor to execute the method described in any one of the above-mentioned first aspects.
  • the intermediate features of the anchor frame are extracted by using the feature extraction network.
  • the optical flow parameters between the intermediate frames of each frame and the adjacent two frames before and after each frame are obtained through the optical flow network (that is, including the previous frame used to describe the intermediate frame to the The forward optical flow parameter of the transformation relationship of the intermediate frame transformation, and the reverse optical flow parameter used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame).
  • the intermediate features of the intermediate frames are then calculated using the optical flow parameters and the intermediate features of the anchor frames located before and after the intermediate frame.
  • the intermediate features of the intermediate frames are obtained by forward-propagating and back-propagating the intermediate features of the anchor frame between the intermediate frames. Therefore, the intermediate features of the intermediate frames retain the transformation information from frame to frame. Therefore, the optimized video obtained after feature estimation based on the intermediate features of each frame improves the continuity to a certain extent.
  • FIG. 1 is a schematic diagram of a network structure of a video optimization model provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a video optimization method provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a network structure of an FFM model provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a video optimization device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the image optimization model is often directly used to individually optimize each frame in the video to achieve video optimization.
  • This method of independently optimizing each video frame in the video based on the image optimization model may cause different video frames to have different optimization effects, affecting the continuity of the optimized video.
  • the present application provides a video optimization method. After extracting the intermediate features of the anchor frames in the sequence of video frames to be optimized, the intermediate features of the anchor frames are placed in the intermediate frames (located between the anchor frames) Forward propagation and backpropagation between video frames) to calculate the intermediate features of the intermediate frames.
  • the intermediate features of the intermediate frames retain the transformation information between frames. Therefore, the optimized video obtained after performing feature estimation on the intermediate features of each frame can guarantee the continuity to a certain extent.
  • a video optimization model provided by this application is exemplarily introduced with reference to FIG. 1 .
  • the video optimization model is deployed in a video processing device, and the video processing device can process a sequence of video frames to be optimized based on the video optimization model, so as to implement the video optimization method provided in this application.
  • the video processing device may be a mobile terminal device such as a smart phone, a tablet computer, or a video camera, or may be a terminal device capable of processing video data such as a desktop computer, a robot, or a server.
  • the video optimization model provided by the present application includes a feature extraction (feature extraction) network GE , an optical flow network (FlowNet ) and a feature estimation network G C .
  • the feature extraction network is used to extract the intermediate features of the input image, and the size of the intermediate features matches the input size required by the feature estimation network.
  • the feature estimation network is used to perform feature estimation on the input intermediate features (including feature mapping, feature reconstruction, etc.), and the output is an optimized image.
  • the feature extraction network and the feature estimation network can be obtained by splitting an image optimization model for image optimization on 2D images.
  • the feature extraction network and feature estimation network are obtained by splitting the image coloring model.
  • the image coloring model can be any network model capable of automatically coloring black and white images, for example, Pix2Pix model, colornet.t7 model, colornet_imagenet.t7 model, etc.
  • the image coloring model generally extracts the intermediate features of the input grayscale image (that is, black and white image, which can be regarded as an L channel image in the Lab domain) through network layers such as layer by layer convolution layer, activation layer, and/or pooling layer. Perform color mapping or color reconstruction on the final extracted intermediate features to obtain an a-channel image and a b-channel image. Finally, the color image corresponding to the grayscale image in the Lab domain is constructed through the a-channel image, b-channel image and the input grayscale image.
  • the sub-network whose input is a grayscale image and whose output is an intermediate feature is defined as a feature extraction network;
  • the sub-network whose input is an intermediate feature and whose output is a color image is defined as a feature estimation network.
  • the feature extraction network and feature estimation network are obtained by splitting the super-resolution model.
  • the super-resolution model can be any network model capable of mapping low-resolution images to high-resolution images, for example, FSRCNN model, CARN model, SRResNet model, RCAN model, etc.
  • the super-resolution model generally extracts the intermediate features of the input low-resolution image through layer-by-layer convolutional layers, residual layers, pooling layers, and/or deconvolution layers, and then upsamples the final extracted intermediate features. (i.e., image reconstruction) to obtain the corresponding high-resolution image.
  • image reconstruction i.e., image reconstruction
  • the sub-network whose input is a low-resolution image and whose output is an intermediate feature is defined as a feature extraction network
  • the sub-network whose input is an intermediate feature and whose output is a high-resolution image is defined as a feature estimation network.
  • video optimization scenarios such as video rain removal, video defogging, and video color adjustment may also be included.
  • video optimization scenarios such as video rain removal, video defogging, and video color adjustment may also be included.
  • the optical flow network is used to estimate the optical flow parameters of two adjacent video frames, that is, the amount of movement of the same object from one video frame to another, which can describe the transformation from one video frame to another. transform relationship.
  • FlowNet2.0 may be used as the optical flow network in this application.
  • the video processing device obtains the video frame sequence to be optimized, and after determining the M frame anchor frame and the N-M frame intermediate frame in the video frame sequence, the video frame sequence can be input into the trained Processed in the video optimization model to obtain the optimized video.
  • the video frame sequence to be optimized may be a video segment cut out from a video, or a complete video.
  • the video frame sequence includes N frames of video.
  • N video frames including the first video frame and the Nth video frame there are M anchor frames, where M is a positive integer greater than 2 and less than N.
  • the M frames of anchor frames may be designated manually, or may be identified by the video processing device from N frames of video frames according to preset anchor frame extraction rules. For example, if the number of intervals between intermediate frames is set to 10, then the video processing device can start from the first video frame, identify the first video frame as the first frame anchor frame, and identify the twelfth frame of video after an interval of 10 intermediate frames The frame is the second anchor frame, and so on until the Nth video frame is identified as the Mth anchor frame. It can be understood that the number of intermediate frames between the anchor frame of the M-th frame and the anchor frame of the M-1 frame may be less than 10 frames.
  • the intermediate frame is the video frame between two adjacent anchor frames in the N frame video frame, for example, the first video frame and the 12th video frame are two adjacent video frames, located in the first video frame
  • the 2nd-11th frame video frame between the 12th frame and the 12th video frame is an intermediate frame.
  • the video processing device performs video optimization on the video frame sequence to be optimized based on the video optimization model, as shown in FIG. 2 , including:
  • the first video frame x 1 and the fourth video frame x 4 are anchor frames
  • the second video frame x 2 and the third video frame x 3 are intermediate frames.
  • the video processing device inputs x 1 and x 4 respectively to the feature extraction network GE for processing to obtain the intermediate feature F 1 of x 1 and the intermediate feature F 4 of x 4 .
  • the forward optical flow parameters of the intermediate frame are used to describe the transformation relationship from the previous frame of the intermediate frame to the intermediate frame
  • the reverse optical flow parameters of the intermediate frame are used to describe the transition from the next frame of the intermediate frame to the intermediate frame.
  • the video processing device inputs x 1 and x 2 into the optical flow network, and obtains the forward optical flow parameter f 1 ⁇ 2 of x 2 (used to describe x 1 to x 2 transform transformation relationship).
  • the video processing device inputs x 2 and x 3 into the optical flow network to obtain the forward optical flow parameter f 2 ⁇ 3 of x 3 (used to describe the transformation relationship from x 2 to x 3 ).
  • Input x 4 and x 3 into the optical flow network to obtain the reverse optical flow parameter f 4 ⁇ 3 of x 3 (used to describe the transformation relationship from x 4 to x 3 ).
  • the optical flow parameters are used to make the intermediate features of the two anchor frames propagate between the intermediate frames. That is to say, the optical flow parameters between each intermediate frame and the adjacent two frames of video frames are calculated through the optical flow network, and based on each optical flow parameter, the intermediate features of the anchor frame are forward or backward Propagated frame by frame such that the intermediate features of the intermediate frame are aligned to the intermediate features of the anchor frame.
  • the value of i is ⁇ 1, 2, ..., N-1, N ⁇ , when the i-th video frame is an intermediate frame:
  • the video processing device can use the forward optical flow parameters of the i-th frame of video frame to perform shape transformation on the intermediate features of the i-1th frame of video frame to obtain the forward feature of the i-th frame of video frame; Perform shape transformation on the reverse feature of the i+1 frame video frame to the optical flow parameter to obtain the reverse feature of the i frame video frame; for the forward feature of the i frame video frame and the reverse feature of the i frame video frame Feature fusion is performed to obtain the intermediate features of the i-th video frame.
  • the intermediate feature of the anchor frame can also be used as the reverse feature and forward feature of the anchor frame, that is, the values of the intermediate feature, reverse feature, and forward feature of the anchor frame are the same. That is to say, if the i+1th frame video frame is an anchor frame, the reverse feature value of the i+1th frame video frame is the intermediate feature of the i+1th frame video frame extracted by the feature extraction network.
  • the intermediate feature F4 of x4 is backpropagated to obtain the reverse features of x2 and x3 . That is, use the reverse optical flow parameter f 4 ⁇ 3 of x 3 to perform a shape change (warp) operation on F 4 to obtain the reverse feature of x 3 get After that, continue to use the reverse optical flow parameter f 3 ⁇ 2 of x 2 for Perform warp operation to get the reverse feature of x 2
  • the intermediate feature F1 of x1 is forward propagated to obtain the intermediate features of x2 and x3 . That is, use the forward optical flow parameter f 1 ⁇ 2 of x 2 to perform warp operation on F 1 to obtain the forward feature of x 2 followed by and Perform feature fusion to obtain the intermediate feature F 2 of x 2 . After obtaining F 2 , continue to use the forward optical flow parameter f 2 ⁇ 3 of x 3 to perform warp operation on F 2 to obtain the forward feature of x 3 followed by and Perform feature fusion to obtain the intermediate feature F 3 of x 3 .
  • the intermediate features of one anchor frame are transmitted backwards first, and the intermediate features of another anchor frame are forward propagated.
  • This two-way transmission of information can complement the information loss caused by the optical flow network and warp operation in a single transmission direction, and the temporal continuity of the intermediate features of each frame. This is more conducive to the subsequent video optimization effect.
  • the intermediate features of the intermediate frame are calculated based on the intermediate features of the anchor frames located on both sides of the intermediate frame, when there is a changing scene in the video frame sequence, some The influence information only exists in this time interval (even between two anchor frames), and will not affect the accuracy of intermediate features of intermediate frames in other time intervals.
  • the feature fusion when performing feature fusion on the forward feature of the i-th video frame and the reverse feature of the i-th video frame, the feature fusion can be performed by numerical calculation, or the feature fusion network can be set in the video optimization model. feature fusion.
  • the feature fusion network may be a conventional feature fusion network, for example, a field-aware factorization machine (FFM), a factorization machine (Factorization Machines, FM) and the like with field-aware capabilities.
  • FFM field-aware factorization machine
  • FM Factorization Machines
  • this embodiment of the present application provides an improved FFM model, by inputting the i-1th video frame x i-1 , the i-th video frame x i , the i+1th video frame x i+1 , Forward features of the i-th video frame Reverse feature of the i-th video frame Forward features of the i-1th video frame And the reverse feature of the i+1th frame video frame Perform a feature fusion operation to output the intermediate feature F i of the i-th video frame. That is, as shown in FIG. 1 , the video optimization model provided in the embodiment of the present application also includes the FFM model provided in the embodiment of the present application.
  • a convolutional layer is used to extract features on xi-1 , xi and xi+1 respectively.
  • the extracted features are combined (concat) to obtain a combined feature.
  • the merged features are then fed into the weighting network and feature refinement network respectively.
  • the weight estimation network and the feature compensation network are respectively composed of multiple convolutional layers.
  • the weight estimation network is based on the input merged features, and After multi-layer convolution operation, the weight matrix W is output.
  • W to and Weighted can achieve the and The choice of the same pixel in the middle, get a fusion feature (E.g, ).
  • the feature compensation network merges the input features, the convolutional and After multi-layer convolution operation, the output can be compared with Corresponding Supplementary Features
  • the supplementary feature can be restored in the calculation and In the process, the missing information is caused by the optical flow network and warp operation.
  • the intermediate feature F i of the i-th video frame is obtained.
  • the FFM model provided by this application can not only refer to the positive features of the i-1th video frame x i-1 , the i+1th video frame x i+1 , and the i-1th video frame And the reverse feature of the i+1th frame video frame to construct the F i of the i-th video frame. That is, the information of the previous and subsequent frames is considered, so that the F i of the i-th video frame has more continuity with the intermediate features of the previous and subsequent frames in time. while being able to and Supplement missing information due to optical flow network and warp operation. Therefore, the continuity of the intermediate features of the intermediate frames can be further improved.
  • the feature estimation network is used to perform feature estimation on the intermediate features of the i-th frame grayscale image
  • the optimized image of the i-th frame grayscale image includes:
  • the anchor frame is selected, and after extracting the intermediate features of the anchor frame, the intermediate feature of the anchor frame is extracted.
  • the features are forward-propagated and back-propagated between the intermediate frames to compute intermediate features for the intermediate frames.
  • the intermediate features of the intermediate frames retain the transformation information between frames. Therefore, the optimized video obtained after performing feature estimation on the intermediate features of each frame can guarantee the continuity to a certain extent.
  • an initial model for video optimization which includes an initial network for feature extraction, an initial network for optical flow, and an initial network for feature estimation.
  • a corresponding image optimization model can be selected based on a specific video optimization scenario. Then the image optimization model is split to obtain the corresponding feature extraction initial network and feature estimation initial network.
  • the feature fusion network can also be set in the initial model of video optimization.
  • the improved FFM initial model provided by the present application may be set in the video optimization initial model.
  • the trained video optimization module includes the above-mentioned trained feature extraction network, optical flow network, feature estimation network and FFM model.
  • the training set includes a plurality of video frame sequence samples to be optimized. Since unsupervised training is adopted, the training set may not need to collect corresponding color video frame sequences.
  • the design of the loss function can be designed based on the actual video optimization scenario. For example, taking the video optimization scene of black and white video coloring as an example, the loss function can be designed as:
  • M is the occlusion matrix
  • N is the frame number of video frame sequence samples
  • d is the interval between adjacent frames
  • a gradient descent algorithm may be used during training.
  • the parameters of the network are learned through iteration.
  • the initial learning rate can be set to 1e-4, and every 50,000 iterations, the learning rate is decayed by half until the network converges.
  • video optimization model and training method provided by this application are universal. It can be applied to any video optimization task or any task that takes the video optimization effect as an evaluation index.
  • the embodiment of the present application provides a video optimization device.
  • the device embodiment corresponds to the aforementioned method embodiment.
  • the details in the present invention will be described one by one, but it should be clear that the device in this embodiment can correspondingly implement all the content in the foregoing method embodiments.
  • FIG. 4 is a schematic structural diagram of a video optimization device provided by an embodiment of the present application.
  • the video optimization device provided by this embodiment includes: an extraction unit 401 , a determination unit 402 and an estimation unit 403 .
  • the extraction unit 401 is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes a video frame sequence
  • the 1st video frame and the Nth video frame of , M is a positive integer greater than 2 and less than N.
  • the determination unit 402 is used to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames respectively by using the trained optical flow network, and the forward optical flow parameters of the intermediate frames are used to describe the direction of the previous frame of the intermediate frame.
  • the transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame, and the intermediate frame is a video frame other than the anchor point frame in the video to be optimized.
  • the determination unit 402 is further configured to determine the intermediate features of the N-M intermediate frames according to the forward optical flow parameters and reverse optical flow parameters of the N-M intermediate frames, and the intermediate features of the M anchor frames.
  • the estimation unit 403 is configured to use the trained feature estimation network to perform feature estimation processing on the intermediate features of N frames of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.
  • the video optimization device provided in this embodiment can execute the above-mentioned method embodiment, and its implementation principle and technical effect are similar, and details are not repeated here.
  • FIG. 5 is a schematic structural diagram of a terminal device provided in an embodiment of the present application.
  • the terminal device provided in this embodiment includes: a memory 501 and a processor 502, the memory 501 is used to store computer programs; the processor 502 is used to The methods described in the above method embodiments are executed when the computer program is called.
  • the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 501 and executed by the processor 502 to complete this Apply the method described in the examples.
  • the one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program in the terminal device.
  • FIG. 5 is only an example of a terminal device, and does not constitute a limitation on the terminal device. It may include more or less components than those shown in the figure, or combine certain components, or different components, such as
  • the terminal device may also include an input and output device, a network access device, a bus, and the like.
  • the processor 502 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the storage 501 may be an internal storage unit of the terminal device, such as a hard disk or memory of the terminal device.
  • the memory 82 may also be an external storage device of the terminal device, such as a plug-in hard disk equipped on the terminal device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, Flash card (Flash Card), etc.
  • the memory 501 may also include both an internal storage unit of the terminal device and an external storage device.
  • the memory 501 is used to store the computer program and other programs and data required by the terminal device.
  • the memory 501 can also be used to temporarily store data that has been output or will be output.
  • the terminal device provided in this embodiment can execute the foregoing method embodiment, and its implementation principle and technical effect are similar, and details are not repeated here.
  • the embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method described in the foregoing method embodiment is implemented.
  • the embodiment of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement the method described in the foregoing method embodiments when executed.
  • the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments in the present application can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a computer-readable storage medium.
  • the computer program When executed by a processor, the steps in the above-mentioned various method embodiments can be realized.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form.
  • the computer-readable storage medium may at least include: any entity or device capable of carrying computer program codes to a photographing device/terminal device, a recording medium, a computer memory, a read-only memory (Read-Only Memory, ROM), a random access Memory (Random Access Memory, RAM), electrical carrier signal, telecommunication signal and software distribution medium.
  • a photographing device/terminal device a recording medium
  • a computer memory a read-only memory (Read-Only Memory, ROM), a random access Memory (Random Access Memory, RAM), electrical carrier signal, telecommunication signal and software distribution medium.
  • ROM read-only memory
  • RAM random access Memory
  • electrical carrier signal telecommunication signal and software distribution medium.
  • U disk mobile hard disk
  • magnetic disk or optical disk etc.
  • computer readable media may not be electrical carrier signals and telecommunication signals under legislation and patent practice.
  • the disclosed device/device and method can be implemented in other ways.
  • the device/device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the term “if” may be construed, depending on the context, as “when” or “once” or “in response to determining” or “in response to detecting “.
  • the phrase “if determined” or “if [the described condition or event] is detected” may be construed, depending on the context, to mean “once determined” or “in response to the determination” or “once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
  • references to "one embodiment” or “some embodiments” or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application.
  • appearances of the phrases “in one embodiment,” “in some embodiments,” “in other embodiments,” “in other embodiments,” etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean “one or more but not all embodiments” unless specifically stated otherwise.
  • the terms “including”, “comprising”, “having” and variations thereof mean “including but not limited to”, unless specifically stated otherwise.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

A video optimization method and apparatus, a terminal device, and a storage medium, relating to the technical field of deep learning, and can improve the continuity of video optimization. The video optimization method comprises: using a trained feature extraction network to respectively extract intermediate features of M frame anchor point frames in a video frame sequence to be optimized (S201), wherein said video frame sequence comprises N frame video frames, and the M frame anchor point frames comprise a first frame video frame and an N-th frame video frame of said video frame sequence; respectively determining a forward optical flow parameter and a reverse optical flow parameter of each N-M frame intermediate frame using a trained optical flow network (S202); determining an intermediate feature of the N-M frame intermediate frame according to the forward optical flow parameter and the reverse optical flow parameter of the N-M frame intermediate frame, and the intermediate features of the M frame anchor point frames (S203); and performing feature estimation on intermediate features of N frame video frames of said video frame sequence using a trained feature estimation network to obtain N frame optimized images, wherein the N frame optimized images constitute an optimized video of said video frame sequence (S204).

Description

一种视频优化方法、装置、终端设备及存储介质A video optimization method, device, terminal equipment and storage medium 技术领域technical field
本申请涉及深度学习技术领域,尤其涉及一种视频优化方法、装置、终端设备及存储介质。The present application relates to the technical field of deep learning, and in particular to a video optimization method, device, terminal equipment and storage medium.
背景技术Background technique
视频优化一般包括视频去噪、视频去雨、视频超分、视频调色、黑白视频上色等优化操作。目前,在基于深度学习的视频优化方案中,常常采用图像优化模型(例如图像去噪模型、图像去雨模型、超分模型、图像调色模型、黑白图像上色模型等)提取视频中的每一帧视频帧的中间特征,并对每一帧视频帧的中间特征进行特征估计,得到每一帧视频帧对应的优化图像,从而实现视频的优化。Video optimization generally includes optimization operations such as video denoising, video rain removal, video super resolution, video color correction, and black and white video coloring. At present, in deep learning-based video optimization schemes, image optimization models (such as image denoising models, image deraining models, super-resolution models, image color toning models, black and white image coloring models, etc.) are often used to extract each image in the video. The intermediate features of a frame of video frame, and perform feature estimation on the intermediate features of each frame of video frame, and obtain the optimized image corresponding to each frame of video frame, so as to realize the optimization of video.
然而,这种基于图像优化模型,对视频中每一帧视频帧进行独立优化的方式,可能会导致不同视频帧具备不同的优化效果,影响优化后视频的连续性。However, this method of independently optimizing each video frame in the video based on the image optimization model may cause different video frames to have different optimization effects, affecting the continuity of the optimized video.
发明内容Contents of the invention
有鉴于此,本申请提供一种视频优化方法、装置、终端设备及存储介质,提高优化视频的连续性。In view of this, the present application provides a video optimization method, device, terminal equipment, and storage medium, so as to improve the continuity of the optimized video.
第一方面,本申请提供一种视频优化方法,包括:In a first aspect, the present application provides a video optimization method, including:
利用已训练的特征提取网络分别提取待优化的视频帧序列中的M帧锚点帧的中间特征,视频帧序列包括N帧视频帧,M帧锚点帧包括视频帧序列的第1帧视频帧和第N帧视频帧,M为大于2且小于N的正整数;利用已训练的光流网络分别确定N-M帧中间帧的正向光流参数和反向光流参数,中间帧的正向光流参数用于描述中间帧的前一帧向该中间帧变换的变换关系,中间帧的反向光流参数用于描述中间帧的后一帧向该中间帧变换的变换关系,中间帧为待优化视频中除锚点帧以外的视频帧;根据N-M帧 中间帧的正向光流参数和反向光流参数,以及M帧锚点帧的中间特征,确定N-M帧中间帧的中间特征;利用已训练的特征估计网络分别对视频帧序列的每一帧视频帧的中间特征进行特征估计,得到N帧优化图像,N帧优化图像构成该视频帧序列的优化视频。Use the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized. The video frame sequence includes N frame video frames, and the M frame anchor frame includes the first video frame of the video frame sequence. and the Nth video frame, M is a positive integer greater than 2 and less than N; use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frame respectively, and the forward optical flow parameter of the intermediate frame The flow parameters are used to describe the transformation relationship from the previous frame of the intermediate frame to the intermediate frame, and the reverse optical flow parameters of the intermediate frame are used to describe the transformation relationship from the subsequent frame to the intermediate frame. Optimizing the video frames except the anchor point frame in the video; according to the forward optical flow parameter and the reverse optical flow parameter of the N-M frame intermediate frame, and the intermediate feature of the M frame anchor frame, determine the intermediate feature of the N-M frame intermediate frame; utilize The trained feature estimation network performs feature estimation on the intermediate features of each frame of the video frame sequence respectively to obtain N frames of optimized images, and the N frames of optimized images constitute the optimized video of the video frame sequence.
在一种可选的实现方式中,根据N-M帧中间帧的正向光流参数和反向光流参数,以及M帧锚点帧的中间特征,确定N-M帧中间帧的中间特征,包括:In an optional implementation manner, according to the forward optical flow parameter and the reverse optical flow parameter of the intermediate frame of the N-M frame, and the intermediate feature of the anchor point frame of the M frame, the intermediate feature of the intermediate frame of the N-M frame is determined, including:
针对视频帧序列中的第i帧视频帧,i取值为{1,2,……,N-1,N},当第i帧视频帧为中间帧时:利用第i帧视频帧的正向光流参数对第i-1帧视频帧的中间特征进行形状变换,得到第i帧视频帧的正向特征;利用第i帧视频帧的反向光流参数对第i+1帧视频帧的反向特征进行形状变换,得到第i帧视频帧的反向特征;对第i帧视频帧的正向特征和第i帧视频帧的反向特征进行特征融合,得到第i帧视频帧的中间特征;其中,若第i+1帧视频帧为锚点帧,第i+1帧视频帧的反向特征取值为第i+1帧视频帧的中间特征。For the i-th frame video frame in the video frame sequence, the value of i is {1, 2, ..., N-1, N}, when the i-th frame video frame is an intermediate frame: use the positive value of the i-th frame video frame Transform the shape of the intermediate features of the i-1th video frame to the optical flow parameters to obtain the forward features of the i-th video frame; use the reverse optical flow parameters of the i-th video frame to transform the i+1th video frame Transform the shape of the reverse feature of the i-th video frame to obtain the reverse feature of the i-th video frame; perform feature fusion on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the i-th video frame Intermediate features; Wherein, if the i+1th video frame is an anchor frame, the reverse feature value of the i+1th video frame is the intermediate feature of the i+1th video frame.
在一种可选的实现方式中,对第i帧视频帧的正向特征和第i帧视频帧的反向特征进行特征融合,得到第i帧视频帧的中间特征,包括:In an optional implementation, feature fusion is performed on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the intermediate features of the i-th video frame, including:
将第i-1帧视频帧、第i帧视频帧、第i+1帧视频帧、第i帧视频帧的正向特征、第i帧视频帧的反向特征、第i-1帧视频帧的正向特征和第i+1帧视频帧的反向特征输入到已训练的FFM模型中进行融合处理,得到第i帧视频帧的中间特征,其中,若第i-1帧视频帧为锚点帧,第i-1帧视频帧的正向特征取值为第i-1帧视频帧的中间特征。The i-1th video frame, the i-th video frame, the i+1th video frame, the forward feature of the i-th video frame, the reverse feature of the i-th video frame, the i-1th video frame The forward features of the i+1th video frame and the reverse features of the i+1th video frame are input into the trained FFM model for fusion processing, and the intermediate features of the i-th video frame are obtained, wherein, if the i-1th video frame is the anchor Point frame, the value of the forward feature of the i-1th frame video frame is the intermediate feature of the i-1th frame video frame.
在一种可选的实现方式中,融合处理包括:In an optional implementation manner, fusion processing includes:
获取第i-1帧视频帧、第i帧视频帧和第i+1帧视频帧的融合特征;对融合特征、第i帧视频帧的正向特征和第i帧视频帧的反向特征进行权重估计,得到权重矩阵;利用权重矩阵对第i帧视频帧的正向特征和第i帧视频帧的反向特征进行加权,得到加权特征;对加权特征、融合特征、第i-1帧 视频帧的正向特征和第i+1帧视频帧的反向特征进行卷积计算,得到补充特征;将补充特征和加权特征进行叠加,得到第i帧视频帧的中间特征。Obtain the fusion feature of the i-1th frame video frame, the i-th frame video frame and the i+1-th frame video frame; perform fusion feature, the forward feature of the i-th frame video frame and the reverse feature of the i-th frame video frame Weight estimation to obtain a weight matrix; use the weight matrix to weight the forward features of the i-th video frame and the reverse features of the i-th video frame to obtain weighted features; weighted features, fusion features, i-1 frame video The forward feature of the frame and the reverse feature of the i+1th video frame are convoluted to obtain supplementary features; the supplementary features and weighted features are superimposed to obtain the intermediate features of the i-th video frame.
在一种可选的实现方式中,方法还包括:In an optional implementation, the method also includes:
构建视频优化初始模型,视频优化初始模型包括特征提取初始网络、光流初始网络、特征估计初始网络和FFM初始模型;利用预设的损失函数和训练集对视频优化初始模型进行无监督训练,得到已训练的特征提取网络、光流网络、特征估计网络和FFM模型;其中,训练集包括多个待优化的视频帧序列样本。Construct the initial model of video optimization, which includes the initial network of feature extraction, initial network of optical flow, initial network of feature estimation and initial model of FFM; use the preset loss function and training set to conduct unsupervised training on the initial model of video optimization, and get The trained feature extraction network, optical flow network, feature estimation network and FFM model; wherein, the training set includes a plurality of video frame sequence samples to be optimized.
在一种可选的实现方式中,特征提取网络和特征估计网络由预设的图像优化模型拆分得到,图像优化模型用于对二维图像进行图像优化。In an optional implementation manner, the feature extraction network and the feature estimation network are obtained by splitting a preset image optimization model, and the image optimization model is used to perform image optimization on a two-dimensional image.
在一种可选的实现方式中,图像优化模型为图像上色模型,视频帧序列包括N帧灰度图;针对视频帧序列中的第i帧灰度图,i取值为{1,2,……,N-1,N},利用特征估计网络对第i帧灰度图的中间特征进行特征估计,得到第i帧灰度图的优化图像包括:In an optional implementation, the image optimization model is an image coloring model, and the video frame sequence includes N frames of grayscale images; for the i-th frame grayscale image in the video frame sequence, the value of i is {1, 2 ,..., N-1, N}, use the feature estimation network to perform feature estimation on the intermediate features of the grayscale image of the i-th frame, and obtain the optimized image of the grayscale image of the i-th frame including:
对第i帧灰度图的中间特征进行颜色估计,得到与第i帧灰度图对应的a通道图像和b通道图像;根据第i帧灰度图、a通道图像和b通道图像得到第i帧灰度图在Lab域中的彩色图像,彩色图像为第i帧灰度图的优化图像。Perform color estimation on the intermediate features of the i-th grayscale image, and obtain the a-channel image and b-channel image corresponding to the i-th frame grayscale image; obtain the i-th The color image of the frame grayscale image in the Lab domain, and the color image is the optimized image of the i-th frame grayscale image.
第二方面,本申请提供一种视频优化装置,包括:In a second aspect, the present application provides a video optimization device, including:
提取单元,用于利用已训练的特征提取网络分别提取待优化的视频帧序列中的M帧锚点帧的中间特征,视频帧序列包括N帧视频帧,M帧锚点帧包括视频帧序列的第1帧视频帧和第N帧视频帧,M为大于2且小于N的正整数;The extraction unit is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frame in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes the video frame sequence The first video frame and the Nth video frame, M is a positive integer greater than 2 and less than N;
确定单元,用于利用已训练的光流网络分别确定N-M帧中间帧的正向光流参数和反向光流参数,中间帧的正向光流参数用于描述中间帧的前一帧向该中间帧变换的变换关系,中间帧的反向光流参数用于描述中间帧的后一帧向该中间帧变换的变换关系,中间帧为待优化视频中除锚点帧以外 的视频帧;The determining unit is used to determine the forward optical flow parameters and the reverse optical flow parameters of the N-M frame intermediate frame respectively by using the trained optical flow network, and the forward optical flow parameter of the intermediate frame is used to describe the direction of the previous frame of the intermediate frame to the The transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the next frame of the intermediate frame to the intermediate frame transformation, and the intermediate frame is a video frame other than the anchor point frame in the video to be optimized;
确定单元,还用于根据N-M帧中间帧的正向光流参数和反向光流参数,以及M帧锚点帧的中间特征,确定N-M帧中间帧的中间特征;The determination unit is also used to determine the intermediate features of the N-M frame intermediate frames according to the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames, and the intermediate features of the M frame anchor frame;
估计单元,用于利用已训练的特征估计网络分别对视频帧序列的N帧视频帧的中间特征进行特征估计处理,得到N帧优化图像,N帧优化图像构成视频帧序列的优化视频。The estimation unit is used to use the trained feature estimation network to perform feature estimation processing on the intermediate features of the N frames of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute the optimized video of the video frame sequence.
第三方面,本申请提供一种终端设备,包括:存储器和处理器,存储器用于存储计算机程序;处理器用于在调用计算机程序时执行上述第一方面中任一方式所述的方法。In a third aspect, the present application provides a terminal device, including: a memory and a processor, where the memory is used to store a computer program; and the processor is used to execute the method described in any one of the above first aspects when calling the computer program.
第四方面,本申请提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现如上述第一方面中任一方式所述的方法。In a fourth aspect, the present application provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method described in any one of the above-mentioned first aspects is implemented.
第五方面,本申请实施例提供一种计算机程序产品,当计算机程序产品在处理器上运行时,使得处理器执行上述第一方面中任一方式所述的方法。In a fifth aspect, an embodiment of the present application provides a computer program product, which, when the computer program product runs on a processor, causes the processor to execute the method described in any one of the above-mentioned first aspects.
基于本申请所提供的视频优化方法、装置、终端设备及存储介质,通过在待优化的视频帧序列中提取锚点帧,利用特征提取网络提取锚点帧的中间特征。而针对位于锚点帧之间的中间帧,通过光流网络获取每一帧中间帧分别与相邻的前后两帧之间的光流参数(即包括用于描述中间帧的前一帧向该中间帧变换的变换关系的正向光流参数,和用于描述中间帧的后一帧向该中间帧变换的变换关系的反向光流参数)。然后利用光流参数和位于中间帧前后的锚点帧的中间特征来计算中间帧的中间特征。即实现了通过将锚点帧的中间特征在中间帧之间正向传播和反向传播,来获取中间帧的中间特征。因此,中间帧的中间特征保留有帧与帧之间的变换信息。从而使得基于各个帧的中间特征进行特征估计后获得的优化视频,在一定程度上提高了连续性。Based on the video optimization method, device, terminal equipment and storage medium provided in the present application, by extracting the anchor frame from the video frame sequence to be optimized, the intermediate features of the anchor frame are extracted by using the feature extraction network. For the intermediate frames between the anchor frames, the optical flow parameters between the intermediate frames of each frame and the adjacent two frames before and after each frame are obtained through the optical flow network (that is, including the previous frame used to describe the intermediate frame to the The forward optical flow parameter of the transformation relationship of the intermediate frame transformation, and the reverse optical flow parameter used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame). The intermediate features of the intermediate frames are then calculated using the optical flow parameters and the intermediate features of the anchor frames located before and after the intermediate frame. That is, the intermediate features of the intermediate frames are obtained by forward-propagating and back-propagating the intermediate features of the anchor frame between the intermediate frames. Therefore, the intermediate features of the intermediate frames retain the transformation information from frame to frame. Therefore, the optimized video obtained after feature estimation based on the intermediate features of each frame improves the continuity to a certain extent.
附图说明Description of drawings
图1为本申请一实施例提供的视频优化模型的网络结构示意图;FIG. 1 is a schematic diagram of a network structure of a video optimization model provided by an embodiment of the present application;
图2为本申请一实施例提供的一种视频优化方法的流程示意图;FIG. 2 is a schematic flowchart of a video optimization method provided by an embodiment of the present application;
图3为本申请一实施例提供的一种FFM模型的网络结构示意图;FIG. 3 is a schematic diagram of a network structure of an FFM model provided by an embodiment of the present application;
图4为本申请一实施例提供的视频优化装置的结构示意图;FIG. 4 is a schematic structural diagram of a video optimization device provided by an embodiment of the present application;
图5为本申请一实施例提供的终端设备的结构示意图。FIG. 5 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
具体实施方式Detailed ways
目前,基于深度学习的视频优化算法中,往往直接使用图像优化模型对视频中的各个帧一一进行单独优化,来实现视频优化。这种基于图像优化模型,对视频中每一帧视频帧进行独立优化的方式,可能会导致不同视频帧具备不同的优化效果,影响优化后视频的连续性。At present, in the video optimization algorithm based on deep learning, the image optimization model is often directly used to individually optimize each frame in the video to achieve video optimization. This method of independently optimizing each video frame in the video based on the image optimization model may cause different video frames to have different optimization effects, affecting the continuity of the optimized video.
针对这一问题,本申请提供一种视频优化方法,提取待优化的视频帧序列中的锚点帧的中间特征后,通过将锚点帧的中间特征在中间帧(位于锚点帧之间的视频帧)之间正向传播和反向传播,以计算得到中间帧的中间特征。使得中间帧的中间特征保留有帧与帧之间的变换信息。从而使得将各个帧的中间特征进行特征估计后获得的优化视频,在一定程度上保证了连续性。To solve this problem, the present application provides a video optimization method. After extracting the intermediate features of the anchor frames in the sequence of video frames to be optimized, the intermediate features of the anchor frames are placed in the intermediate frames (located between the anchor frames) Forward propagation and backpropagation between video frames) to calculate the intermediate features of the intermediate frames. The intermediate features of the intermediate frames retain the transformation information between frames. Therefore, the optimized video obtained after performing feature estimation on the intermediate features of each frame can guarantee the continuity to a certain extent.
下面以具体地实施例对本申请的技术方案进行详细说明。下面这几个具体的实施例可以相互结合,对于相同或相似的概念或过程可能在某些实施例不再赘述。The technical solution of the present application will be described in detail below with specific embodiments. The following specific embodiments may be combined with each other, and the same or similar concepts or processes may not be repeated in some embodiments.
首先,结合图1对本申请提供的一种视频优化模型进行示例性的介绍。该视频优化模型部署在视频处理设备中,视频处理设备可以基于该视频优化模型来处理待优化的视频帧序列,以实现本申请提供的视频优化方法。其中,视频处理设备可以是智能手机、平板电脑、摄像机等移动终端设备, 还可以是台式电脑、机器人、服务器等能够处理视频数据的终端设备。First, a video optimization model provided by this application is exemplarily introduced with reference to FIG. 1 . The video optimization model is deployed in a video processing device, and the video processing device can process a sequence of video frames to be optimized based on the video optimization model, so as to implement the video optimization method provided in this application. Wherein, the video processing device may be a mobile terminal device such as a smart phone, a tablet computer, or a video camera, or may be a terminal device capable of processing video data such as a desktop computer, a robot, or a server.
示例性的,如图1所示,本申请提供的视频优化模型包括特征提取(feature extraction)网络G E、光流网络(FlowNet)和特征估计网络G CExemplarily, as shown in FIG. 1 , the video optimization model provided by the present application includes a feature extraction (feature extraction) network GE , an optical flow network ( FlowNet ) and a feature estimation network G C .
其中,特征提取网络用于提取输入图像的中间特征,该中间特征的尺寸与特征估计网络所要求的输入尺寸相匹配。特征估计网络用于对输入的中间特征进行特征估计(包括特征映射、特征重建等),输出得到优化图像。Among them, the feature extraction network is used to extract the intermediate features of the input image, and the size of the intermediate features matches the input size required by the feature estimation network. The feature estimation network is used to perform feature estimation on the input intermediate features (including feature mapping, feature reconstruction, etc.), and the output is an optimized image.
在一个示例中,特征提取网络和特征估计网络可以通过将一个用于对二维图像进行图像优化的图像优化模型拆分得到。In one example, the feature extraction network and the feature estimation network can be obtained by splitting an image optimization model for image optimization on 2D images.
例如,当该视频优化模型应用于黑白视频上色的视频优化场景时,该特征提取网络和特征估计网络由图像上色模型拆分得到。其中,图像上色模型可以是任意能够实现黑白图像自动上色的网络模型,例如,Pix2Pix模型、colornet.t7模型、colornet_imagenet.t7模型等。For example, when the video optimization model is applied to the video optimization scene of black and white video coloring, the feature extraction network and feature estimation network are obtained by splitting the image coloring model. Wherein, the image coloring model can be any network model capable of automatically coloring black and white images, for example, Pix2Pix model, colornet.t7 model, colornet_imagenet.t7 model, etc.
图像上色模型一般通过层层卷积层、激活层和/或池化层等网络层提取输入的灰度图(即黑白图像,可以视为Lab域中的L通道图像)的中间特征,在将最后提取到的中间特征进行颜色映射或者颜色重建,得到a通道图像和b通道图像。最后通过a通道图像、b通道图像和输入的灰度图构建灰度图在Lab域中对应的彩色图像。The image coloring model generally extracts the intermediate features of the input grayscale image (that is, black and white image, which can be regarded as an L channel image in the Lab domain) through network layers such as layer by layer convolution layer, activation layer, and/or pooling layer. Perform color mapping or color reconstruction on the final extracted intermediate features to obtain an a-channel image and a b-channel image. Finally, the color image corresponding to the grayscale image in the Lab domain is constructed through the a-channel image, b-channel image and the input grayscale image.
在进行网络模型拆分时,可以从输出a通道图像和b通道图像的网络层之前的任意中间层进行拆分,得到两部分子网络。其中,输入为灰度图,输出为中间特征的子网络定义为特征提取网络;输入为中间特征,输出为彩色图像的子网络定义为特征估计网络。When splitting the network model, it can be split from any intermediate layer before the network layer that outputs the a-channel image and the b-channel image to obtain two sub-networks. Among them, the sub-network whose input is a grayscale image and whose output is an intermediate feature is defined as a feature extraction network; the sub-network whose input is an intermediate feature and whose output is a color image is defined as a feature estimation network.
又例如,当该视频优化模型应用于视频超分(即将低分辨率的视频转换为高分辨率的视频)的视频优化场景时,该特征提取网络和特征估计网络由超分辨模型拆分得到。其中,超分辨模型可以是任意能够实现将低分辨率图像映射为高分辨率图像的网络模型,例如,FSRCNN模型、CARN模型,SRResNet模型,RCAN模型等。For another example, when the video optimization model is applied to the video optimization scene of video super-resolution (that is, converting low-resolution video to high-resolution video), the feature extraction network and feature estimation network are obtained by splitting the super-resolution model. Wherein, the super-resolution model can be any network model capable of mapping low-resolution images to high-resolution images, for example, FSRCNN model, CARN model, SRResNet model, RCAN model, etc.
超分辨模型一般通过层层卷积层、残差层、池化层和/或反卷积层等网络层提取输入的低分辨率图像的中间特征,在将最后提取到的中间特征进行上采样(即图像重建),得到对应的高分辨率图像。在进行网络模型拆分时,可以从上采样层之前的任意中间层进行拆分,得到两部分子网络。其中,输入为低分辨率图像,输出为中间特征的子网络定义为特征提取网络;输入为中间特征,输出为高分辨率图像的子网络定义为特征估计网络。The super-resolution model generally extracts the intermediate features of the input low-resolution image through layer-by-layer convolutional layers, residual layers, pooling layers, and/or deconvolution layers, and then upsamples the final extracted intermediate features. (i.e., image reconstruction) to obtain the corresponding high-resolution image. When splitting the network model, it can be split from any intermediate layer before the upsampling layer to obtain two sub-networks. Among them, the sub-network whose input is a low-resolution image and whose output is an intermediate feature is defined as a feature extraction network; the sub-network whose input is an intermediate feature and whose output is a high-resolution image is defined as a feature estimation network.
可以理解的是,对应于不同的视频优化场景,例如,除上述视频上色、视频超分外,还可以包括视频去雨、视频去雾、视频调色等视频优化场景。可以直接将对应场景的图像优化模型进行拆分,来搭建视频优化模型。此处不在一一列举。It can be understood that corresponding to different video optimization scenarios, for example, in addition to the above-mentioned video colorization and video super-resolution, video optimization scenarios such as video rain removal, video defogging, and video color adjustment may also be included. You can directly split the image optimization model corresponding to the scene to build a video optimization model. I won't list them all here.
光流网络用于估计相邻两帧视频帧的光流参数,即同一对象从一帧视频帧移动到另一帧视频帧的移动量,能够描述一帧视频帧向另一帧视频帧变换的变换关系。示例性的,在本申请中可以采用FlowNet2.0作为光流网络。The optical flow network is used to estimate the optical flow parameters of two adjacent video frames, that is, the amount of movement of the same object from one video frame to another, which can describe the transformation from one video frame to another. transform relationship. Exemplarily, FlowNet2.0 may be used as the optical flow network in this application.
基于上述视频优化模型,视频处理设备获取到待优化的视频帧序列,并确定该视频帧序列中的M帧锚点帧和N-M帧中间帧后,即可将该视频帧序列输入至已训练的视频优化模型中处理,得到优化视频。Based on the above video optimization model, the video processing device obtains the video frame sequence to be optimized, and after determining the M frame anchor frame and the N-M frame intermediate frame in the video frame sequence, the video frame sequence can be input into the trained Processed in the video optimization model to obtain the optimized video.
其中,待优化的视频帧序列可以是从一个视频中截取出来的视频片段,也可以是一个完整的视频。假设,该视频帧序列包括N帧视频。该N帧视频帧中包含第1帧视频帧和第N帧视频帧在内,共存在M帧锚点帧,M为大于2且小于N的正整数。Wherein, the video frame sequence to be optimized may be a video segment cut out from a video, or a complete video. Assume that the video frame sequence includes N frames of video. In the N video frames including the first video frame and the Nth video frame, there are M anchor frames, where M is a positive integer greater than 2 and less than N.
该M帧锚点帧可以是人为指定的,也可以是视频处理设备根据预设的锚点帧提取规则从N帧视频帧中识别的。例如,设置中间帧的间隔数为10,那么视频处理设备可以从第1帧视频帧开始,识别第1帧视频帧为第1帧锚点帧,间隔10帧中间帧后,识别第12帧视频帧为第2帧锚点帧,依次类推,直至识别第N帧视频帧为第M帧锚点帧。可以理解的是,第M帧锚点帧与第M-1帧锚点帧之间的中间帧数可能小于10帧。顾名思义,中间 帧即为N帧视频帧中位于相邻两个锚点帧之间的视频帧,比如第1帧视频帧和第12帧视频帧为相邻两个视频帧,位于第1帧视频帧和第12帧视频帧之间的第2-11帧视频帧为中间帧。The M frames of anchor frames may be designated manually, or may be identified by the video processing device from N frames of video frames according to preset anchor frame extraction rules. For example, if the number of intervals between intermediate frames is set to 10, then the video processing device can start from the first video frame, identify the first video frame as the first frame anchor frame, and identify the twelfth frame of video after an interval of 10 intermediate frames The frame is the second anchor frame, and so on until the Nth video frame is identified as the Mth anchor frame. It can be understood that the number of intermediate frames between the anchor frame of the M-th frame and the anchor frame of the M-1 frame may be less than 10 frames. As the name implies, the intermediate frame is the video frame between two adjacent anchor frames in the N frame video frame, for example, the first video frame and the 12th video frame are two adjacent video frames, located in the first video frame The 2nd-11th frame video frame between the 12th frame and the 12th video frame is an intermediate frame.
示例性的,视频处理设备基于该视频优化模型对待优化的视频帧序列进行视频优化的过程,可以如图2所示,包括:Exemplarily, the video processing device performs video optimization on the video frame sequence to be optimized based on the video optimization model, as shown in FIG. 2 , including:
S201,利用已训练的特征提取网络分别提取M个帧锚点帧的中间特征。S201. Use the trained feature extraction network to extract the intermediate features of M frame anchor frames respectively.
例如,以图1中所示的连续4帧视频帧x 1,x 2,x 3,x 4为例。其中,第1帧视频帧x 1和第4帧视频帧x 4为锚点帧,第2帧视频帧x 2和第3帧视频帧x 3为中间帧。视频处理设备将x 1和x 4分别输入至特征提取网络G E进行处理,得到x 1的中间特征F 1和x 4的中间特征F 4For example, take four consecutive video frames x 1 , x 2 , x 3 , and x 4 shown in FIG. 1 as an example. Among them, the first video frame x 1 and the fourth video frame x 4 are anchor frames, and the second video frame x 2 and the third video frame x 3 are intermediate frames. The video processing device inputs x 1 and x 4 respectively to the feature extraction network GE for processing to obtain the intermediate feature F 1 of x 1 and the intermediate feature F 4 of x 4 .
S202,利用已训练的光流网络分别确定N-M帧中间帧的正向光流参数和反向光流参数。S202, using the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the intermediate frames of the N-M frames respectively.
其中,中间帧的正向光流参数用于描述该中间帧的前一帧向该中间帧变换的变换关系,中间帧的反向光流参数用于描述该中间帧的后一帧向该中间帧变换的变换关系。Among them, the forward optical flow parameters of the intermediate frame are used to describe the transformation relationship from the previous frame of the intermediate frame to the intermediate frame, and the reverse optical flow parameters of the intermediate frame are used to describe the transition from the next frame of the intermediate frame to the intermediate frame. The transformation relationship of frame transformation.
例如,如图1所示,针对中间帧x 2,视频处理设备将x 1和x 2输入光流网络,得到x 2的正向光流参数f 1→2(用于描述x 1向x 2变换的变换关系)。将x 3和x 2输入光流网络,得到x 2的反向光流参数f 3→2(用于描述x 3向x 2变换的变换关系)。针对中间帧x 3,视频处理设备将x 2和x 3输入光流网络,得到x 3的正向光流参数f 2→3(用于描述x 2向x 3变换的变换关系)。将x 4和x 3输入光流网络,得到x 3的反向光流参数f 4→3(用于描述x 4向x 3变换的变换关系)。 For example, as shown in Figure 1, for the intermediate frame x 2 , the video processing device inputs x 1 and x 2 into the optical flow network, and obtains the forward optical flow parameter f 1→2 of x 2 (used to describe x 1 to x 2 transform transformation relationship). Input x 3 and x 2 into the optical flow network to obtain the reverse optical flow parameter f 3→2 of x 2 (used to describe the transformation relationship from x 3 to x 2 ). For the intermediate frame x 3 , the video processing device inputs x 2 and x 3 into the optical flow network to obtain the forward optical flow parameter f 2→3 of x 3 (used to describe the transformation relationship from x 2 to x 3 ). Input x 4 and x 3 into the optical flow network to obtain the reverse optical flow parameter f 4→3 of x 3 (used to describe the transformation relationship from x 4 to x 3 ).
S203,根据N-M帧中间帧的正向光流参数和反向光流参数,以及M帧锚点帧的中间特征,确定N-M帧中间帧的中间特征。S203. According to the forward optical flow parameters and reverse optical flow parameters of the intermediate frames of the N-M frames, and the intermediate features of the anchor frames of the M frames, determine the intermediate features of the intermediate frames of the N-M frames.
在本申请实施例中,针对位于相邻两个锚点帧之间的中间帧,利用光流参数使得两个锚点帧的中间特征中间帧之间传播。也就是说,通过光流网络计算得到每个中间帧分别与相邻的前后两帧视频帧之间的光流参数,基于每个光流参数使得锚点帧的中间特征向向前或者向后一帧一帧传播,使得中间帧的中间特征向锚点帧的中间特征对齐。In the embodiment of the present application, for an intermediate frame located between two adjacent anchor frames, the optical flow parameters are used to make the intermediate features of the two anchor frames propagate between the intermediate frames. That is to say, the optical flow parameters between each intermediate frame and the adjacent two frames of video frames are calculated through the optical flow network, and based on each optical flow parameter, the intermediate features of the anchor frame are forward or backward Propagated frame by frame such that the intermediate features of the intermediate frame are aligned to the intermediate features of the anchor frame.
示例性的,针对视频帧序列中的第i帧视频帧,i取值为{1,2,……,N-1,N},当第i帧视频帧为中间帧时:Exemplarily, for the i-th video frame in the video frame sequence, the value of i is {1, 2, ..., N-1, N}, when the i-th video frame is an intermediate frame:
视频处理设备可以利用第i帧视频帧的正向光流参数对第i-1帧视频帧的中间特征进行形状变换,得到第i帧视频帧的正向特征;利用第i帧视频帧的反向光流参数对第i+1帧视频帧的反向特征进行形状变换,得到第i帧视频帧的反向特征;对第i帧视频帧的正向特征和第i帧视频帧的反向特征进行特征融合,得到第i帧视频帧的中间特征。The video processing device can use the forward optical flow parameters of the i-th frame of video frame to perform shape transformation on the intermediate features of the i-1th frame of video frame to obtain the forward feature of the i-th frame of video frame; Perform shape transformation on the reverse feature of the i+1 frame video frame to the optical flow parameter to obtain the reverse feature of the i frame video frame; for the forward feature of the i frame video frame and the reverse feature of the i frame video frame Feature fusion is performed to obtain the intermediate features of the i-th video frame.
值得说明的是,锚点帧的中间特征同时也可以作为该锚点帧的反向特征和正向特征,即锚点帧的中间特征、反向特征、正向特征的取值相同。也就是说,如果第i+1帧视频帧为锚点帧,则第i+1帧视频帧的反向特征取值为通过特征提取网络提取的第i+1帧视频帧的中间特征。It is worth noting that the intermediate feature of the anchor frame can also be used as the reverse feature and forward feature of the anchor frame, that is, the values of the intermediate feature, reverse feature, and forward feature of the anchor frame are the same. That is to say, if the i+1th frame video frame is an anchor frame, the reverse feature value of the i+1th frame video frame is the intermediate feature of the i+1th frame video frame extracted by the feature extraction network.
例如,以图1所示的x 1,x 2,x 3和x 4为例,说明锚点帧x 1和x 4的中间特征在中间帧x 2和x 3之间反向传播和正向传播,以获取中间帧x 2和x 3的中间特征的方式。 For example, taking x 1 , x 2 , x 3 and x 4 shown in Figure 1 as an example, it shows that the intermediate features of the anchor frame x 1 and x 4 are back-propagated and forward-propagated between the intermediate frames x 2 and x 3 , in a way to obtain the intermediate features of the intermediate frames x2 and x3 .
如图1所示,首先将x 4的中间特征F 4进行反向传播,得到x 2和x 3的反向特征。即利用x 3的反向光流参数f 4→3对F 4进行形状变化(warp)操作,得到x 3的反向特征
Figure PCTCN2021137583-appb-000001
得到
Figure PCTCN2021137583-appb-000002
之后,继续利用x 2的反向光流参数f 3→2
Figure PCTCN2021137583-appb-000003
进行warp操作,得到x 2的反向特征
Figure PCTCN2021137583-appb-000004
As shown in Figure 1, firstly, the intermediate feature F4 of x4 is backpropagated to obtain the reverse features of x2 and x3 . That is, use the reverse optical flow parameter f 4→3 of x 3 to perform a shape change (warp) operation on F 4 to obtain the reverse feature of x 3
Figure PCTCN2021137583-appb-000001
get
Figure PCTCN2021137583-appb-000002
After that, continue to use the reverse optical flow parameter f 3→2 of x 2 for
Figure PCTCN2021137583-appb-000003
Perform warp operation to get the reverse feature of x 2
Figure PCTCN2021137583-appb-000004
然后基于x 2和x 3的反向特征,将x 1的中间特征F 1进行正向传播,得到x 2和x 3的中间特征。即利用x 2的正向光流参数f 1→2对F 1进行warp操作,得到x 2 的正向特征
Figure PCTCN2021137583-appb-000005
然后将
Figure PCTCN2021137583-appb-000006
Figure PCTCN2021137583-appb-000007
进行特征融合,得到x 2的中间特征F 2。得到F 2之后,继续利用x 3的正向光流参数f 2→3对F 2进行warp操作,得到x 3的正向特征
Figure PCTCN2021137583-appb-000008
然后将
Figure PCTCN2021137583-appb-000009
Figure PCTCN2021137583-appb-000010
进行特征融合,得到x 3的中间特征F 3
Then based on the reverse features of x2 and x3 , the intermediate feature F1 of x1 is forward propagated to obtain the intermediate features of x2 and x3 . That is, use the forward optical flow parameter f 1→2 of x 2 to perform warp operation on F 1 to obtain the forward feature of x 2
Figure PCTCN2021137583-appb-000005
followed by
Figure PCTCN2021137583-appb-000006
and
Figure PCTCN2021137583-appb-000007
Perform feature fusion to obtain the intermediate feature F 2 of x 2 . After obtaining F 2 , continue to use the forward optical flow parameter f 2→3 of x 3 to perform warp operation on F 2 to obtain the forward feature of x 3
Figure PCTCN2021137583-appb-000008
followed by
Figure PCTCN2021137583-appb-000009
and
Figure PCTCN2021137583-appb-000010
Perform feature fusion to obtain the intermediate feature F 3 of x 3 .
可以看出,在计算中间帧的中间特征时,先将一个锚点帧的中间特征反向传输一遍,又将另一个锚点帧的中间特征正向传播。这种信息的双向传输,能够相互补充单一传输方向上因为光流网络和warp操作带来的信息损失,各个帧的中间特征在时间上的连续性。从而更有利于后续的视频优化效果。It can be seen that when calculating the intermediate features of the intermediate frame, the intermediate features of one anchor frame are transmitted backwards first, and the intermediate features of another anchor frame are forward propagated. This two-way transmission of information can complement the information loss caused by the optical flow network and warp operation in a single transmission direction, and the temporal continuity of the intermediate features of each frame. This is more conducive to the subsequent video optimization effect.
另外,由于中间帧的中间特征是基于位于该中间帧两边的锚点帧的中间特征来计算的,因此,当视频帧序列中具备变换的场景时,每次由场景的切换所带来的一些影响信息仅在该时间区间(即便两个锚点帧之间)存在,并不会影响其他时间区间中中间帧的中间特征的准确性。In addition, since the intermediate features of the intermediate frame are calculated based on the intermediate features of the anchor frames located on both sides of the intermediate frame, when there is a changing scene in the video frame sequence, some The influence information only exists in this time interval (even between two anchor frames), and will not affect the accuracy of intermediate features of intermediate frames in other time intervals.
其中,在对第i帧视频帧的正向特征和第i帧视频帧的反向特征进行特征融合时,可以采用数值计算的方式进行特征融合,也可以在视频优化模型中设置特征融合网络进行特征融合。该特征融合网络可以是常规的特征融合网络,例如,具有现场感知能力的因素分解机(field-aware factorization machine,FFM)、因子分解机(Factorization Machines,FM)等。Among them, when performing feature fusion on the forward feature of the i-th video frame and the reverse feature of the i-th video frame, the feature fusion can be performed by numerical calculation, or the feature fusion network can be set in the video optimization model. feature fusion. The feature fusion network may be a conventional feature fusion network, for example, a field-aware factorization machine (FFM), a factorization machine (Factorization Machines, FM) and the like with field-aware capabilities.
可选的,本申请实施例提供一种改进的FFM模型,通过输入第i-1帧视频帧x i-1、第i帧视频帧x i、第i+1帧视频帧x i+1、第i帧视频帧的正向特征
Figure PCTCN2021137583-appb-000011
第i帧视频帧的反向特征
Figure PCTCN2021137583-appb-000012
第i-1帧视频帧的正向特征
Figure PCTCN2021137583-appb-000013
和第i+1帧视频帧的反向特征
Figure PCTCN2021137583-appb-000014
执行特征融合操作,输出第i帧视频帧的中间特征F i。即如图1所示,在本申请实施例提供的视频优化模型中,还包括本申请实施例提供的FFM模型。
Optionally, this embodiment of the present application provides an improved FFM model, by inputting the i-1th video frame x i-1 , the i-th video frame x i , the i+1th video frame x i+1 , Forward features of the i-th video frame
Figure PCTCN2021137583-appb-000011
Reverse feature of the i-th video frame
Figure PCTCN2021137583-appb-000012
Forward features of the i-1th video frame
Figure PCTCN2021137583-appb-000013
And the reverse feature of the i+1th frame video frame
Figure PCTCN2021137583-appb-000014
Perform a feature fusion operation to output the intermediate feature F i of the i-th video frame. That is, as shown in FIG. 1 , the video optimization model provided in the embodiment of the present application also includes the FFM model provided in the embodiment of the present application.
示例性的,以融合第i帧视频帧的正向特征
Figure PCTCN2021137583-appb-000015
和第i帧视频帧的反向特征
Figure PCTCN2021137583-appb-000016
得到第i帧视频帧的中间特征F i为例,本申请提供的FFM模型的网 络结构可以如图3所示。
Exemplary, to fuse the forward features of the i-th frame video frame
Figure PCTCN2021137583-appb-000015
and the reverse feature of the i-th video frame
Figure PCTCN2021137583-appb-000016
Taking the intermediate feature F i of the i-th video frame as an example, the network structure of the FFM model provided in this application can be shown in FIG. 3 .
首先对x i-1、x i和x i+1进行特征提取,例如,采用一个卷积层分别对x i-1、x i和x i+1进行特征提取。将提取到的特征进行合并(concat)得到一个合并特征。然后将合并特征分别输入到权重估计网络(weighting network)和特征补偿网络(feature refine network)中。 Firstly, feature extraction is performed on xi-1 , xi and xi+1 , for example, a convolutional layer is used to extract features on xi-1 , xi and xi+1 respectively. The extracted features are combined (concat) to obtain a combined feature. The merged features are then fed into the weighting network and feature refinement network respectively.
其中,权重估计网络和特征补偿网络分别是由多个卷积层构成。权重估计网络根据输入的合并特征、
Figure PCTCN2021137583-appb-000017
Figure PCTCN2021137583-appb-000018
进行多层卷积操作后,输出权重矩阵W。利用W对
Figure PCTCN2021137583-appb-000019
Figure PCTCN2021137583-appb-000020
进行加权,可以实现对
Figure PCTCN2021137583-appb-000021
Figure PCTCN2021137583-appb-000022
中同一像素点的取舍,得到一个融合特征
Figure PCTCN2021137583-appb-000023
(例如,
Figure PCTCN2021137583-appb-000024
)。
Among them, the weight estimation network and the feature compensation network are respectively composed of multiple convolutional layers. The weight estimation network is based on the input merged features,
Figure PCTCN2021137583-appb-000017
and
Figure PCTCN2021137583-appb-000018
After multi-layer convolution operation, the weight matrix W is output. Using W to
Figure PCTCN2021137583-appb-000019
and
Figure PCTCN2021137583-appb-000020
Weighted, can achieve the
Figure PCTCN2021137583-appb-000021
and
Figure PCTCN2021137583-appb-000022
The choice of the same pixel in the middle, get a fusion feature
Figure PCTCN2021137583-appb-000023
(E.g,
Figure PCTCN2021137583-appb-000024
).
Figure PCTCN2021137583-appb-000025
经过1×1的卷积操作后,输入到特征补偿网络中。特征补偿网络对输入的合并特征、经过卷积后的
Figure PCTCN2021137583-appb-000026
Figure PCTCN2021137583-appb-000027
进行多层卷积操作后,可以输出与
Figure PCTCN2021137583-appb-000028
对应的补充特征
Figure PCTCN2021137583-appb-000029
该补充特征
Figure PCTCN2021137583-appb-000030
可以还原在计算
Figure PCTCN2021137583-appb-000031
Figure PCTCN2021137583-appb-000032
的过程中,由于光流网络和warp操作所导致缺失的信息。将
Figure PCTCN2021137583-appb-000033
Figure PCTCN2021137583-appb-000034
叠加后得到第i帧视频帧的中间特征F i
Will
Figure PCTCN2021137583-appb-000025
After a 1×1 convolution operation, it is input into the feature compensation network. The feature compensation network merges the input features, the convolutional
Figure PCTCN2021137583-appb-000026
and
Figure PCTCN2021137583-appb-000027
After multi-layer convolution operation, the output can be compared with
Figure PCTCN2021137583-appb-000028
Corresponding Supplementary Features
Figure PCTCN2021137583-appb-000029
The supplementary feature
Figure PCTCN2021137583-appb-000030
can be restored in the calculation
Figure PCTCN2021137583-appb-000031
and
Figure PCTCN2021137583-appb-000032
In the process, the missing information is caused by the optical flow network and warp operation. Will
Figure PCTCN2021137583-appb-000033
and
Figure PCTCN2021137583-appb-000034
After the superposition, the intermediate feature F i of the i-th video frame is obtained.
值得说明的是,本申请提供的FFM模型不但能够参考第i-1帧视频帧x i-1、第i+1帧视频帧x i+1、第i-1帧视频帧的正向特征
Figure PCTCN2021137583-appb-000035
和第i+1帧视频帧的反向特征
Figure PCTCN2021137583-appb-000036
来构建第i帧视频帧的F i。即考虑了前后帧的信息,使得第i帧视频帧的F i在时间上与前后帧的中间特征更具备连续性。同时能够根据
Figure PCTCN2021137583-appb-000037
Figure PCTCN2021137583-appb-000038
补充因为光流网络和warp操作所导致缺失的信息。因此,可以进一步提高中间帧的中间特征的连续性。
It is worth noting that the FFM model provided by this application can not only refer to the positive features of the i-1th video frame x i-1 , the i+1th video frame x i+1 , and the i-1th video frame
Figure PCTCN2021137583-appb-000035
And the reverse feature of the i+1th frame video frame
Figure PCTCN2021137583-appb-000036
to construct the F i of the i-th video frame. That is, the information of the previous and subsequent frames is considered, so that the F i of the i-th video frame has more continuity with the intermediate features of the previous and subsequent frames in time. while being able to
Figure PCTCN2021137583-appb-000037
and
Figure PCTCN2021137583-appb-000038
Supplement missing information due to optical flow network and warp operation. Therefore, the continuity of the intermediate features of the intermediate frames can be further improved.
S204,利用已训练的特征估计网络分别对视频帧序列的每一帧视频帧的中间特征进行特征估计,得到N帧优化图像,N帧优化图像构成视频帧序列的优化视频。S204. Use the trained feature estimation network to perform feature estimation on the intermediate features of each frame of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.
以黑白视频上色的视频优化场景为例。针对视频帧序列中的第i帧灰度图,利用特征估计网络对第i帧灰度图的中间特征进行特征估计,得到第i 帧灰度图的优化图像包括:Take the video optimization scenario of black and white video colorization as an example. For the i-th frame grayscale image in the video frame sequence, the feature estimation network is used to perform feature estimation on the intermediate features of the i-th frame grayscale image, and the optimized image of the i-th frame grayscale image includes:
对第i帧灰度图的中间特征进行颜色估计,得到输出信息
Figure PCTCN2021137583-appb-000039
其中,
Figure PCTCN2021137583-appb-000040
包括与第i帧灰度图对应的a通道图像和b通道图像;根据第i帧灰度图、a通道图像和b通道图像得到第i帧灰度图在Lab域中的彩色图像,该彩色图像即为第i帧灰度图的优化图像。
Perform color estimation on the intermediate features of the i-th frame grayscale image to obtain output information
Figure PCTCN2021137583-appb-000039
in,
Figure PCTCN2021137583-appb-000040
Including the a-channel image and b-channel image corresponding to the i-th frame grayscale image; according to the i-th frame grayscale image, a-channel image and b-channel image, the color image of the i-th frame grayscale image in the Lab domain is obtained, and the color The image is the optimized image of the i-th frame grayscale image.
例如,以图1中,得到x 1,x 2,x 3和x 4的中间特征后,分别将x 1,x 2,x 3和x 4的中间特征输入到特征估计网络G C中处理,输出得到分别与x 1,x 2,x 3和x 4对应的输出信息
Figure PCTCN2021137583-appb-000041
由于图1中以应用于黑白视频上色的视频优化场景为例,因此,每个输出信息
Figure PCTCN2021137583-appb-000042
包括a通道图像和b通道图像。将a通道图像、b通道图像和灰度图合并后,即可得到分别与x 1,x 2,x 3和x 4对应的彩色图像。
For example, in Figure 1, after obtaining the intermediate features of x 1 , x 2 , x 3 and x 4 , input the intermediate features of x 1 , x 2 , x 3 and x 4 into the feature estimation network G C for processing, Output Get the output information corresponding to x 1 , x 2 , x 3 and x 4 respectively
Figure PCTCN2021137583-appb-000041
Since the video optimization scene applied to black and white video coloring is taken as an example in Figure 1, each output information
Figure PCTCN2021137583-appb-000042
Including a channel image and b channel image. After the a-channel image, b-channel image and grayscale image are combined, color images corresponding to x 1 , x 2 , x 3 and x 4 respectively can be obtained.
综上可知,采用本申请提供的视频优化方法,由于不在独立提取每个视频帧的中间特征,而是选取锚点帧,并在提取锚点帧的中间特征后,通过将锚点帧的中间特征在中间帧之间正向传播和反向传播,以计算得到中间帧的中间特征。使得中间帧的中间特征保留有帧与帧之间的变换信息。从而使得将各个帧的中间特征进行特征估计后获得的优化视频,在一定程度上保证了连续性。In summary, with the video optimization method provided by this application, instead of extracting the intermediate features of each video frame independently, the anchor frame is selected, and after extracting the intermediate features of the anchor frame, the intermediate feature of the anchor frame is extracted. The features are forward-propagated and back-propagated between the intermediate frames to compute intermediate features for the intermediate frames. The intermediate features of the intermediate frames retain the transformation information between frames. Therefore, the optimized video obtained after performing feature estimation on the intermediate features of each frame can guarantee the continuity to a certain extent.
下面对本申请提供的视频优化模型的训练过程进行示例性的说明。The training process of the video optimization model provided by this application is described as an example below.
首先,构建视频优化初始模型,该视频优化初始模型包括特征提取初始网络、光流初始网络和特征估计初始网络。First, an initial model for video optimization is constructed, which includes an initial network for feature extraction, an initial network for optical flow, and an initial network for feature estimation.
可以理解的是,构建视频优化初始模型时,可以基于具体的视频优化场景来选择对应的图像优化模型。然后将图像优化模型进行拆分,得到对应特征提取初始网络和特征估计初始网络。It can be understood that when constructing the initial video optimization model, a corresponding image optimization model can be selected based on a specific video optimization scenario. Then the image optimization model is split to obtain the corresponding feature extraction initial network and feature estimation initial network.
另外,若采用特征融合网络对中间特征和反向特征进行特征融合,则该视频优化初始模型中还可以设置特征融合网络。例如,可以在视频优化初始模型设置本申请提供的改进后的FFM的初始模型。In addition, if the feature fusion network is used to fuse the intermediate features and reverse features, then the feature fusion network can also be set in the initial model of video optimization. For example, the improved FFM initial model provided by the present application may be set in the video optimization initial model.
之后,利用预设的损失函数和训练集对该视频优化初始模型进行无监督训练,得到已训练的视频优化模块。相应的,已训练的视频优化模块包括上述已训练的特征提取网络、光流网络、特征估计网络和FFM模型。Afterwards, unsupervised training is performed on the video optimization initial model using the preset loss function and training set to obtain a trained video optimization module. Correspondingly, the trained video optimization module includes the above-mentioned trained feature extraction network, optical flow network, feature estimation network and FFM model.
在本申请实施例中,训练集包括多个待优化的视频帧序列样本。由于采用无监督训练,因此,该训练集可以不需要采集对应的彩色视频帧序列。In the embodiment of the present application, the training set includes a plurality of video frame sequence samples to be optimized. Since unsupervised training is adopted, the training set may not need to collect corresponding color video frame sequences.
损失函数的设计可以基于实际的视频优化场景来设计。例如,以黑白视频上色的视频优化场景为例,损失函数可以设计为:The design of the loss function can be designed based on the actual video optimization scenario. For example, taking the video optimization scene of black and white video coloring as an example, the loss function can be designed as:
Figure PCTCN2021137583-appb-000043
Figure PCTCN2021137583-appb-000043
其中,M为遮挡矩阵,
Figure PCTCN2021137583-appb-000044
N为视频帧序列样本的帧数,d为相邻帧的间隔,d=1表示相邻两帧,d=2表示相邻间隔一帧的两帧。
Figure PCTCN2021137583-appb-000045
表示第i帧视频帧样本的输出信息。
Figure PCTCN2021137583-appb-000046
表示第i+d帧视频帧样本经过warp操作向第i帧视频帧样本变换的结果。由于需要
Figure PCTCN2021137583-appb-000047
Figure PCTCN2021137583-appb-000048
在内容上保持了一致,因此,基于该损失函数可以损失(loss)约束。
Among them, M is the occlusion matrix,
Figure PCTCN2021137583-appb-000044
N is the frame number of video frame sequence samples, d is the interval between adjacent frames, d=1 means two adjacent frames, and d=2 means two adjacent frames separated by one frame.
Figure PCTCN2021137583-appb-000045
Indicates the output information of the i-th video frame sample.
Figure PCTCN2021137583-appb-000046
Indicates the result of transforming the video frame sample of the i+d frame to the video frame sample of the i frame through the warp operation. due to need
Figure PCTCN2021137583-appb-000047
and
Figure PCTCN2021137583-appb-000048
The content is consistent, so the loss constraint can be lost based on the loss function.
示例性的,在训练时可以采用梯度下降算法。通过迭代学习出网络的参数。例如,初始学习率可以设置为1e-4,每50000个迭代回合,将学习率衰减一半,直到网络收敛。Exemplarily, a gradient descent algorithm may be used during training. The parameters of the network are learned through iteration. For example, the initial learning rate can be set to 1e-4, and every 50,000 iterations, the learning rate is decayed by half until the network converges.
值得说明的是,本申请提供视频优化模型以及训练方法具备泛用性。可以应用于任何视频优化任何或者以视频优化效果为评价指标的任务中。It is worth noting that the video optimization model and training method provided by this application are universal. It can be applied to any video optimization task or any task that takes the video optimization effect as an evaluation index.
基于同一发明构思,作为对上述方法的实现,本申请实施例提供了一种视频优化装置,该装置实施例与前述方法实施例对应,为便于阅读,本装置实施例不再对前述方法实施例中的细节内容进行逐一赘述,但应当明确,本实施例中的装置能够对应实现前述方法实施例中的全部内容。Based on the same inventive concept, as an implementation of the above method, the embodiment of the present application provides a video optimization device. The device embodiment corresponds to the aforementioned method embodiment. The details in the present invention will be described one by one, but it should be clear that the device in this embodiment can correspondingly implement all the content in the foregoing method embodiments.
图4为本申请实施例提供的视频优化装置的结构示意图,如图4所示,本实施例提供的视频优化装置包括:提取单元401、确定单元402和估计单元403。FIG. 4 is a schematic structural diagram of a video optimization device provided by an embodiment of the present application. As shown in FIG. 4 , the video optimization device provided by this embodiment includes: an extraction unit 401 , a determination unit 402 and an estimation unit 403 .
提取单元401,用于利用已训练的特征提取网络分别提取待优化的视频帧序列中的M帧锚点帧的中间特征,视频帧序列包括N帧视频帧,M帧锚点帧包括视频帧序列的第1帧视频帧和第N帧视频帧,M为大于2且小于N的正整数。The extraction unit 401 is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes a video frame sequence The 1st video frame and the Nth video frame of , M is a positive integer greater than 2 and less than N.
确定单元402,用于利用已训练的光流网络分别确定N-M帧中间帧的正向光流参数和反向光流参数,中间帧的正向光流参数用于描述中间帧的前一帧向该中间帧变换的变换关系,中间帧的反向光流参数用于描述中间帧的后一帧向该中间帧变换的变换关系,中间帧为待优化视频中除锚点帧以外的视频帧。The determination unit 402 is used to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames respectively by using the trained optical flow network, and the forward optical flow parameters of the intermediate frames are used to describe the direction of the previous frame of the intermediate frame. The transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame, and the intermediate frame is a video frame other than the anchor point frame in the video to be optimized.
确定单元402,还用于根据N-M帧中间帧的正向光流参数和反向光流参数,以及M帧锚点帧的中间特征,确定N-M帧中间帧的中间特征。The determination unit 402 is further configured to determine the intermediate features of the N-M intermediate frames according to the forward optical flow parameters and reverse optical flow parameters of the N-M intermediate frames, and the intermediate features of the M anchor frames.
估计单元403,用于利用已训练的特征估计网络分别对视频帧序列的N帧视频帧的中间特征进行特征估计处理,得到N帧优化图像,N帧优化图像构成视频帧序列的优化视频。The estimation unit 403 is configured to use the trained feature estimation network to perform feature estimation processing on the intermediate features of N frames of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.
本实施例提供的视频优化装置可以执行上述方法实施例,其实现原理与技术效果类似,此处不再赘述。The video optimization device provided in this embodiment can execute the above-mentioned method embodiment, and its implementation principle and technical effect are similar, and details are not repeated here.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并 不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of each functional unit and module are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above system, reference may be made to the corresponding process in the foregoing method embodiments, and details will not be repeated here.
基于同一发明构思,本申请实施例还提供了一种终端设备。图5为本申请实施例提供的终端设备的结构示意图,如图5所示,本实施例提供的终端设备包括:存储器501和处理器502,存储器501用于存储计算机程序;处理器502用于在调用计算机程序时执行上述方法实施例所述的方法。Based on the same inventive concept, the embodiment of the present application also provides a terminal device. FIG. 5 is a schematic structural diagram of a terminal device provided in an embodiment of the present application. As shown in FIG. 5 , the terminal device provided in this embodiment includes: a memory 501 and a processor 502, the memory 501 is used to store computer programs; the processor 502 is used to The methods described in the above method embodiments are executed when the computer program is called.
示例性的,所述计算机程序可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器501中,并由所述处理器502执行,以完成本申请实施例所述的方法。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序在所述终端设备中的执行过程。Exemplarily, the computer program may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 501 and executed by the processor 502 to complete this Apply the method described in the examples. The one or more modules/units may be a series of computer program instruction segments capable of accomplishing specific functions, and the instruction segments are used to describe the execution process of the computer program in the terminal device.
本领域技术人员可以理解,图5仅仅是终端设备的示例,并不构成对终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端设备还可以包括输入输出设备、网络接入设备、总线等。Those skilled in the art can understand that FIG. 5 is only an example of a terminal device, and does not constitute a limitation on the terminal device. It may include more or less components than those shown in the figure, or combine certain components, or different components, such as The terminal device may also include an input and output device, a network access device, a bus, and the like.
所述处理器502可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The processor 502 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
所述存储器501可以是终端设备的内部存储单元,例如终端设备的硬盘或内存。所述存储器82也可以是所述终端设备的外部存储设备,例如所述终端设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器501还可以既包括所述终端设备的内部存储单元也包括外部存储设备。所述存储器501用于存储所述计算机程序以及所述终端设备所需的 其他程序和数据。所述存储器501还可以用于暂时地存储已经输出或者将要输出的数据。The storage 501 may be an internal storage unit of the terminal device, such as a hard disk or memory of the terminal device. The memory 82 may also be an external storage device of the terminal device, such as a plug-in hard disk equipped on the terminal device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, Flash card (Flash Card), etc. Further, the memory 501 may also include both an internal storage unit of the terminal device and an external storage device. The memory 501 is used to store the computer program and other programs and data required by the terminal device. The memory 501 can also be used to temporarily store data that has been output or will be output.
本实施例提供的终端设备可以执行上述方法实施例,其实现原理与技术效果类似,此处不再赘述。The terminal device provided in this embodiment can execute the foregoing method embodiment, and its implementation principle and technical effect are similar, and details are not repeated here.
本申请实施例还提供一种计算机可读存储介质,其上存储有计算机程序,计算机程序被处理器执行时实现上述方法实施例所述的方法。The embodiment of the present application also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, the method described in the foregoing method embodiment is implemented.
本申请实施例还提供一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行时实现上述方法实施例所述的方法。The embodiment of the present application further provides a computer program product, which, when the computer program product runs on a terminal device, enables the terminal device to implement the method described in the foregoing method embodiments when executed.
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读存储介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。If the above-mentioned integrated units are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments in the present application can be completed by instructing related hardware through computer programs, and the computer programs can be stored in a computer-readable storage medium. The computer program When executed by a processor, the steps in the above-mentioned various method embodiments can be realized. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable storage medium may at least include: any entity or device capable of carrying computer program codes to a photographing device/terminal device, a recording medium, a computer memory, a read-only memory (Read-Only Memory, ROM), a random access Memory (Random Access Memory, RAM), electrical carrier signal, telecommunication signal and software distribution medium. Such as U disk, mobile hard disk, magnetic disk or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunication signals under legislation and patent practice.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使 用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed device/device and method can be implemented in other ways. For example, the device/device embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or Components may be combined or integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、步骤、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、步骤、操作、元素、组件和/或其集合的存在或添加。It should be understood that when used in this specification and the appended claims, the term "comprising" indicates the presence of described features, integers, steps, operations, elements and/or components, but does not exclude one or more other Presence or addition of features, wholes, steps, operations, elements, components and/or collections thereof.
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。It should also be understood that the term "and/or" used in the description of the present application and the appended claims refers to any combination and all possible combinations of one or more of the associated listed items, and includes these combinations.
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于检测到[所描述条件或事件]”。As used in this specification and the appended claims, the term "if" may be construed, depending on the context, as "when" or "once" or "in response to determining" or "in response to detecting ". Similarly, the phrase "if determined" or "if [the described condition or event] is detected" may be construed, depending on the context, to mean "once determined" or "in response to the determination" or "once detected [the described condition or event] ]” or “in response to detection of [described condition or event]”.
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”、“第三”等仅用于区分描述,而不能理解为指示或暗示相对重要性。In addition, in the description of the specification and appended claims of the present application, the terms "first", "second", "third" and so on are only used to distinguish descriptions, and should not be understood as indicating or implying relative importance.
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例 中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。Reference to "one embodiment" or "some embodiments" or the like in the specification of the present application means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the present application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," "in other embodiments," etc. in various places in this specification are not necessarily All refer to the same embodiment, but mean "one or more but not all embodiments" unless specifically stated otherwise. The terms "including", "comprising", "having" and variations thereof mean "including but not limited to", unless specifically stated otherwise.
最后应说明的是:以上各实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述各实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and are not intended to limit it; although the application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: It is still possible to modify the technical solutions described in the foregoing embodiments, or perform equivalent replacements for some or all of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the various embodiments of the present application. scope.

Claims (10)

  1. 一种视频优化方法,其特征在于,所述方法包括:A video optimization method, characterized in that the method comprises:
    利用已训练的特征提取网络分别提取待优化的视频帧序列中的M帧锚点帧的中间特征,所述视频帧序列包括N帧视频帧,M帧所述锚点帧包括所述视频帧序列的第1帧视频帧和第N帧视频帧,M为大于2且小于N的正整数;Utilize the trained feature extraction network to extract the intermediate features of M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frame video frames, and the M frame anchor frame includes the video frame sequence The 1st frame video frame and the Nth frame video frame, M is a positive integer greater than 2 and less than N;
    利用已训练的光流网络分别确定N-M帧中间帧的正向光流参数和反向光流参数,所述中间帧的正向光流参数用于描述所述中间帧的前一帧向所述中间帧变换的变换关系,所述中间帧的反向光流参数用于描述所述中间帧的后一帧向所述中间帧变换的变换关系,所述中间帧为所述待优化视频中除所述锚点帧以外的视频帧;Use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frame respectively, and the forward optical flow parameter of the intermediate frame is used to describe the direction of the previous frame of the intermediate frame to the The transformation relationship of the intermediate frame transformation, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship of the subsequent frame of the intermediate frame to the intermediate frame transformation, and the intermediate frame is the video to be optimized except a video frame other than the anchor frame;
    根据N-M帧所述中间帧的正向光流参数和反向光流参数,以及M帧所述锚点帧的中间特征,确定N-M帧所述中间帧的中间特征;According to the forward optical flow parameter and the reverse optical flow parameter of the intermediate frame described in the N-M frame, and the intermediate feature of the anchor point frame described in the M frame, determine the intermediate feature of the intermediate frame described in the N-M frame;
    利用已训练的特征估计网络分别对所述视频帧序列的每一帧视频帧的中间特征进行特征估计,得到N帧优化图像,所述N帧优化图像构成所述视频帧序列的优化视频。Using the trained feature estimation network to perform feature estimation on the intermediate features of each frame of the video frame sequence to obtain N frames of optimized images, and the N frames of optimized images constitute an optimized video of the video frame sequence.
  2. 根据权利要求1所述的方法,其特征在于,所述根据N-M帧所述中间帧的正向光流参数和反向光流参数,以及M帧所述锚点帧的中间特征,确定N-M帧所述中间帧的中间特征,包括:The method according to claim 1, wherein the N-M frame is determined according to the forward optical flow parameter and the reverse optical flow parameter of the intermediate frame of the N-M frame, and the intermediate feature of the anchor frame of the M frame The intermediate features of the intermediate frame include:
    针对所述视频帧序列中的第i帧视频帧,i取值为{1,2,……,N-1,N},当所述第i帧视频帧为所述中间帧时:For the i-th video frame in the video frame sequence, i takes a value of {1, 2, ..., N-1, N}, when the i-th video frame is the intermediate frame:
    利用所述第i帧视频帧的正向光流参数对第i-1帧视频帧的中间特征进行形状变换,得到所述第i帧视频帧的正向特征;Using the forward optical flow parameters of the i-th video frame to perform shape transformation on the intermediate features of the i-1th video frame to obtain the forward features of the i-th video frame;
    利用所述第i帧视频帧的反向光流参数对第i+1帧视频帧的反向特征进行形状变换,得到所述第i帧视频帧的反向特征;Using the reverse optical flow parameters of the i-th frame video frame to carry out shape transformation to the reverse feature of the i+1 frame video frame, to obtain the reverse feature of the i-th frame video frame;
    对所述第i帧视频帧的正向特征和所述第i帧视频帧的反向特征进行特征融合,得到所述第i帧视频帧的中间特征;Carrying out feature fusion to the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the intermediate feature of the i-th video frame;
    其中,若所述第i+1帧视频帧为所述锚点帧,所述第i+1帧视频帧的反向特征取值为所述第i+1帧视频帧的中间特征。Wherein, if the i+1th video frame is the anchor frame, the reverse feature of the i+1th video frame is an intermediate feature of the i+1th video frame.
  3. 根据权利要求2所述的方法,其特征在于,所述对所述第i帧视频帧的正向特征和所述第i帧视频帧的反向特征进行特征融合,得到所述第i帧视频帧的中间特征,包括:The method according to claim 2, wherein the feature fusion is performed on the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain the i-th video frame Intermediate features of the frame, including:
    将第i-1帧视频帧、所述第i帧视频帧、所述第i+1帧视频帧、所述第i帧视频帧的正向特征、所述第i帧视频帧的反向特征、所述第i-1帧视频帧的正向特征和所述第i+1帧视频帧的反向特征输入到已训练的FFM模型中进行融合处理,得到所述第i帧视频帧的中间特征,其中,若所述第i-1帧视频帧为所述锚点帧,所述第i-1帧视频帧的正向特征取值为所述第i-1帧视频帧的中间特征。The i-1th frame video frame, the i-th frame video frame, the i+1-th frame video frame, the forward feature of the i-th frame video frame, the reverse feature of the i-th frame video frame , the forward feature of the i-1th frame video frame and the reverse feature of the i+1th frame video frame are input into the trained FFM model for fusion processing to obtain the middle of the i-th frame video frame feature, wherein, if the i-1th video frame is the anchor frame, the value of the forward feature of the i-1th video frame is the intermediate feature of the i-1th video frame.
  4. 根据权利要求3所述的方法,其特征在于,所述融合处理包括:The method according to claim 3, wherein the fusion process comprises:
    获取所述第i-1帧视频帧、所述第i帧视频帧和所述第i+1帧视频帧的融合特征;Acquiring fusion features of the i-1th video frame, the i-th video frame, and the i+1th video frame;
    对所述融合特征、所述第i帧视频帧的正向特征和所述第i帧视频帧的反向特征进行权重估计,得到权重矩阵;Carrying out weight estimation to the fusion feature, the forward feature of the i-th video frame and the reverse feature of the i-th video frame, to obtain a weight matrix;
    利用所述权重矩阵对所述第i帧视频帧的正向特征和所述第i帧视频帧的反向特征进行加权,得到加权特征;Using the weight matrix to weight the forward feature of the i-th video frame and the reverse feature of the i-th video frame to obtain weighted features;
    对所述加权特征、所述融合特征、所述第i-1帧视频帧的正向特征和所述第i+1帧视频帧的反向特征进行卷积计算,得到所述补充特征;Perform convolution calculation on the weighted feature, the fusion feature, the forward feature of the i-1th video frame and the reverse feature of the i+1th video frame to obtain the supplementary feature;
    将所述补充特征和所述加权特征进行叠加,得到所述第i帧视频帧的中间特征。The supplementary features and the weighted features are superimposed to obtain the intermediate features of the i-th video frame.
  5. 根据权利要求3所述的方法,其特征在于,所述方法还包括:The method according to claim 3, further comprising:
    构建视频优化初始模型,所述视频优化初始模型包括特征提取初始网络、光流初始网络、特征估计初始网络和FFM初始模型;Build video optimization initial model, described video optimization initial model comprises feature extraction initial network, optical flow initial network, feature estimation initial network and FFM initial model;
    利用预设的损失函数和训练集对所述视频优化初始模型进行无监督训练,得到已训练的所述特征提取网络、所述光流网络、所述特征估计网络和所述FFM模型;Using a preset loss function and training set to carry out unsupervised training on the video optimization initial model to obtain the trained feature extraction network, the optical flow network, the feature estimation network and the FFM model;
    其中,训练集包括多个待优化的视频帧序列样本。Wherein, the training set includes a plurality of video frame sequence samples to be optimized.
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述特征提取网络和所述特征估计网络由预设的图像优化模型拆分得到,所述图像优化模型用于对二维图像进行图像优化。The method according to any one of claims 1-5, wherein the feature extraction network and the feature estimation network are obtained by splitting a preset image optimization model, and the image optimization model is used for two-dimensional Image for image optimization.
  7. 根据权利要求6所述的方法,其特征在于,所述图像优化模型为图像上色模型,所述视频帧序列包括N帧灰度图;The method according to claim 6, wherein the image optimization model is an image coloring model, and the video frame sequence includes N frames of grayscale images;
    针对所述视频帧序列中的第i帧灰度图,i取值为{1,2,……,N-1,N},利用所述特征估计网络对所述第i帧灰度图的中间特征进行特征估计,得到所述第i帧灰度图的优化图像包括:For the i-th frame grayscale image in the video frame sequence, the value of i is {1, 2, ..., N-1, N}, and the feature estimation network is used to analyze the i-th frame grayscale image Perform feature estimation on the intermediate features, and obtain the optimized image of the i-th frame grayscale image including:
    对所述第i帧灰度图的中间特征进行颜色估计,得到与所述第i帧灰度图对应的a通道图像和b通道图像;Carrying out color estimation on the intermediate features of the grayscale image of the i-th frame, and obtaining the a-channel image and the b-channel image corresponding to the grayscale image of the i-th frame;
    根据所述第i帧灰度图、所述a通道图像和所述b通道图像得到所述第i帧灰度图在Lab域中的彩色图像,所述彩色图像为所述第i帧灰度图的优化图像。Obtain the color image of the i-th frame gray-scale image in the Lab domain according to the i-th frame gray-scale image, the a-channel image and the b-channel image, and the color image is the i-th frame gray-scale image Optimized image for graph.
  8. 一种视频优化装置,其特征在于,包括:A video optimization device, characterized in that it comprises:
    提取单元,用于利用已训练的特征提取网络分别提取待优化的视频帧序列中的M帧锚点帧的中间特征,所述视频帧序列包括N帧视频帧,M帧所述锚点帧包括所述视频帧序列的第1帧视频帧和第N帧视频帧,M为大 于2且小于N的正整数;The extraction unit is used to utilize the trained feature extraction network to extract the intermediate features of the M frame anchor frames in the video frame sequence to be optimized respectively, the video frame sequence includes N frames of video frames, and the anchor frame of the M frames includes The first video frame and the Nth video frame of the video frame sequence, M is a positive integer greater than 2 and less than N;
    确定单元,用于利用已训练的光流网络分别确定N-M帧中间帧的正向光流参数和反向光流参数,所述中间帧的正向光流参数用于描述所述中间帧的前一帧向所述中间帧变换的变换关系,所述中间帧的反向光流参数用于描述所述中间帧的后一帧向所述中间帧变换的变换关系,所述中间帧为所述待优化视频中除所述锚点帧以外的视频帧;A determining unit, configured to use the trained optical flow network to determine the forward optical flow parameters and reverse optical flow parameters of the N-M frame intermediate frames, the forward optical flow parameters of the intermediate frames are used to describe the front of the intermediate frames Transformation relationship from one frame to the intermediate frame, the reverse optical flow parameter of the intermediate frame is used to describe the transformation relationship from the next frame of the intermediate frame to the intermediate frame, the intermediate frame is the Video frames other than the anchor frame in the video to be optimized;
    所述确定单元,还用于根据N-M帧所述中间帧的正向光流参数和反向光流参数,以及M帧所述锚点帧的中间特征,确定N-M帧所述中间帧的中间特征;The determining unit is further configured to determine the intermediate features of the intermediate frames of the N-M frames according to the forward optical flow parameters and reverse optical flow parameters of the intermediate frames of the N-M frames, and the intermediate features of the anchor frame of the M frames ;
    估计单元,用于利用已训练的特征估计网络分别对所述视频帧序列的N帧视频帧的中间特征进行特征估计处理,得到N帧优化图像,所述N帧优化图像构成所述视频帧序列的优化视频。The estimation unit is used to perform feature estimation processing on the intermediate features of the N frames of video frames in the video frame sequence by using the trained feature estimation network to obtain N frames of optimized images, and the N frames of optimized images constitute the video frame sequence optimized video for .
  9. 一种终端设备,其特征在于,包括:存储器和处理器,所述存储器用于存储计算机程序;所述处理器用于在调用所述计算机程序时执行如权利要求1-6任一项所述的方法。A terminal device, characterized in that it includes: a memory and a processor, the memory is used to store a computer program; the processor is used to execute the method described in any one of claims 1-6 when calling the computer program method.
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1-6任一项所述的方法。A computer-readable storage medium on which a computer program is stored, wherein the computer program implements the method according to any one of claims 1-6 when executed by a processor.
PCT/CN2021/137583 2021-05-21 2021-12-13 Video optimization method and apparatus, terminal device, and storage medium WO2022242122A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110557336.8 2021-05-21
CN202110557336.8A CN113298728B (en) 2021-05-21 2021-05-21 Video optimization method and device, terminal equipment and storage medium

Publications (1)

Publication Number Publication Date
WO2022242122A1 true WO2022242122A1 (en) 2022-11-24

Family

ID=77323598

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/137583 WO2022242122A1 (en) 2021-05-21 2021-12-13 Video optimization method and apparatus, terminal device, and storage medium

Country Status (2)

Country Link
CN (1) CN113298728B (en)
WO (1) WO2022242122A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455798A (en) * 2023-11-17 2024-01-26 北京同力数矿科技有限公司 Lightweight video denoising method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113298728B (en) * 2021-05-21 2023-01-24 中国科学院深圳先进技术研究院 Video optimization method and device, terminal equipment and storage medium
CN116823973B (en) * 2023-08-25 2023-11-21 湖南快乐阳光互动娱乐传媒有限公司 Black-white video coloring method, black-white video coloring device and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109756690A (en) * 2018-12-21 2019-05-14 西北工业大学 Lightweight view interpolation method based on feature rank light stream
US20190295228A1 (en) * 2018-03-21 2019-09-26 Nvidia Corporation Image in-painting for irregular holes using partial convolutions
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
CN112584077A (en) * 2020-12-11 2021-03-30 北京百度网讯科技有限公司 Video frame interpolation method and device and electronic equipment
CN113298728A (en) * 2021-05-21 2021-08-24 中国科学院深圳先进技术研究院 Video optimization method and device, terminal equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10176845B2 (en) * 2016-09-23 2019-01-08 Apple Inc. Seamless forward-reverse video loops
US10776688B2 (en) * 2017-11-06 2020-09-15 Nvidia Corporation Multi-frame video interpolation using optical flow

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190295228A1 (en) * 2018-03-21 2019-09-26 Nvidia Corporation Image in-painting for irregular holes using partial convolutions
CN109756690A (en) * 2018-12-21 2019-05-14 西北工业大学 Lightweight view interpolation method based on feature rank light stream
CN112104830A (en) * 2020-08-13 2020-12-18 北京迈格威科技有限公司 Video frame insertion method, model training method and corresponding device
CN112584077A (en) * 2020-12-11 2021-03-30 北京百度网讯科技有限公司 Video frame interpolation method and device and electronic equipment
CN113298728A (en) * 2021-05-21 2021-08-24 中国科学院深圳先进技术研究院 Video optimization method and device, terminal equipment and storage medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117455798A (en) * 2023-11-17 2024-01-26 北京同力数矿科技有限公司 Lightweight video denoising method and system

Also Published As

Publication number Publication date
CN113298728A (en) 2021-08-24
CN113298728B (en) 2023-01-24

Similar Documents

Publication Publication Date Title
WO2022242122A1 (en) Video optimization method and apparatus, terminal device, and storage medium
CN109241895B (en) Dense crowd counting method and device
CN109493350B (en) Portrait segmentation method and device
CN108416327B (en) Target detection method and device, computer equipment and readable storage medium
WO2021164234A1 (en) Image processing method and image processing device
CN108960261B (en) Salient object detection method based on attention mechanism
US11755889B2 (en) Method, system and apparatus for pattern recognition
CN112001914A (en) Depth image completion method and device
CN111340077B (en) Attention mechanism-based disparity map acquisition method and device
CN111369550A (en) Image registration and defect detection method, model, training method, device and equipment
CN114782759B (en) Method for detecting densely-occluded fish based on YOLOv5 network
CN112560980A (en) Training method and device of target detection model and terminal equipment
CN109815931B (en) Method, device, equipment and storage medium for identifying video object
WO2022166258A1 (en) Behavior recognition method and apparatus, terminal device, and computer-readable storage medium
CN114821488B (en) Crowd counting method and system based on multi-modal network and computer equipment
CN114170167A (en) Polyp segmentation method and computer device based on attention-guided context correction
WO2022247232A1 (en) Image enhancement method and apparatus, terminal device, and storage medium
US20220366262A1 (en) Method and apparatus for training neural network model
TWI817896B (en) Machine learning method and device
CN113658050A (en) Image denoising method, denoising device, mobile terminal and storage medium
CN116385369A (en) Depth image quality evaluation method and device, electronic equipment and storage medium
CN114170271B (en) Multi-target tracking method, equipment and storage medium with self-tracking consciousness
CN115937121A (en) Non-reference image quality evaluation method and system based on multi-dimensional feature fusion
CN115880176A (en) Multi-scale unpaired underwater image enhancement method
US10832076B2 (en) Method and image processing entity for applying a convolutional neural network to an image

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21940565

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21940565

Country of ref document: EP

Kind code of ref document: A1