WO2022141819A1 - 视频插帧方法、装置、计算机设备及存储介质 - Google Patents
视频插帧方法、装置、计算机设备及存储介质 Download PDFInfo
- Publication number
- WO2022141819A1 WO2022141819A1 PCT/CN2021/081990 CN2021081990W WO2022141819A1 WO 2022141819 A1 WO2022141819 A1 WO 2022141819A1 CN 2021081990 W CN2021081990 W CN 2021081990W WO 2022141819 A1 WO2022141819 A1 WO 2022141819A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- frame
- image
- frame image
- model
- reference frame
- Prior art date
Links
- 238000012966 insertion method Methods 0.000 title claims abstract description 19
- 238000003860 storage Methods 0.000 title claims abstract description 15
- 239000013598 vector Substances 0.000 claims abstract description 172
- 238000003780 insertion Methods 0.000 claims abstract description 110
- 230000037431 insertion Effects 0.000 claims abstract description 110
- 238000012545 processing Methods 0.000 claims abstract description 68
- 238000000034 method Methods 0.000 claims abstract description 55
- 238000003062 neural network model Methods 0.000 claims abstract description 18
- 238000012549 training Methods 0.000 claims description 159
- 230000006870 function Effects 0.000 claims description 49
- 230000015572 biosynthetic process Effects 0.000 claims description 39
- 238000003786 synthesis reaction Methods 0.000 claims description 39
- 238000005070 sampling Methods 0.000 claims description 36
- 238000013527 convolutional neural network Methods 0.000 claims description 35
- 238000000605 extraction Methods 0.000 claims description 14
- 238000010586 diagram Methods 0.000 claims description 10
- 230000004927 fusion Effects 0.000 claims description 7
- 238000011176 pooling Methods 0.000 claims description 7
- 238000009432 framing Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 description 30
- 230000007704 transition Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 9
- 239000000284 extract Substances 0.000 description 5
- 230000000306 recurrent effect Effects 0.000 description 5
- 230000001360 synchronised effect Effects 0.000 description 5
- 238000004590 computer program Methods 0.000 description 4
- 238000012216 screening Methods 0.000 description 4
- 230000005540 biological transmission Effects 0.000 description 3
- 238000013145 classification model Methods 0.000 description 3
- 230000001413 cellular effect Effects 0.000 description 2
- 239000003086 colorant Substances 0.000 description 2
- 239000002131 composite material Substances 0.000 description 2
- 238000002372 labelling Methods 0.000 description 2
- 238000004519 manufacturing process Methods 0.000 description 2
- 230000009467 reduction Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000003709 image segmentation Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/20—Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
- H04N21/23—Processing of content or additional data; Elementary server operations; Server middleware
- H04N21/234—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
- H04N21/2343—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
- H04N21/234381—Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by altering the temporal resolution, e.g. decreasing the frame rate by frame skipping
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/40—Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
- H04N21/43—Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
- H04N21/44—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
- H04N21/4402—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
- H04N21/440281—Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by altering the temporal resolution, e.g. by frame skipping
Definitions
- Embodiments of the present invention relate to the field of video processing, and in particular, to a video frame insertion method, apparatus, computer equipment, and storage medium.
- Frame rate is the frequency (rate) at which bitmap images in units of frames appear continuously on the display.
- the frame rate directly affects the smoothness of video playback. A video with a high frame rate has better playback fluency. On the contrary, the worse it is. When the frame rate is too low, the video playback will freeze.
- the inventor of the present invention found in the research that during the process of live video network live broadcast, due to the multiple and diverse network conditions of users, in an unsatisfactory situation, it is necessary to reduce the bit rate of the transmitted video, and the method of reducing the bit rate includes: reducing the video rate resolution or reduce the video frame rate.
- the video is often subjected to frame reduction processing, and reducing the frame rate will reduce the smoothness of video stream playback and affect the user's viewing experience.
- Embodiments of the present invention provide a video frame insertion method, device, computer equipment, and storage medium capable of improving video playback fluency.
- a technical solution adopted by the embodiment of the present invention is to provide a video frame insertion method, including:
- the first reference frame picture and the second reference frame picture are combined and input into a preset frame insertion model, wherein the frame insertion model is pre-trained to convergence, and is used for according to the first reference frame picture and the first reference frame picture.
- the motion vector between the two reference frame images is a neural network model for frame insertion processing on the target frame image;
- the up-frame image output by the frame insertion model is read, and the up-frame image is inserted between the first reference frame image and the second reference frame image.
- the image standard is a frame rate threshold
- the acquiring the target video to be processed includes:
- the frame rate value represented by the frame rate data is smaller than the frame rate threshold, it is determined that the to-be-played video is the target video.
- the frame insertion model includes a motion vector network model
- the merging and inputting the first reference frame image and the second reference frame image into a preset frame insertion model includes:
- the first reference frame image and the second reference frame image are superimposed on channel images to generate a superimposed image, and the superimposed image is input into the motion vector network model, wherein the motion vector network model is pre-trained To convergence, a convolutional neural network model for extracting motion vectors between images;
- the motion vector network model performs convolution pooling processing on the superimposed image through a convolution layer to generate down-sampling features
- the motion vector network model performs interpolation processing on the down-sampling features through a deconvolution layer to generate up-sampling features
- the motion vector is generated by performing feature fusion and superposition on the down-sampling feature and the up-sampling feature.
- the frame insertion model includes a frame synthesis network model
- the merging and inputting the first reference frame image and the second reference frame image into the preset frame insertion model includes:
- the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image are combined and input into the frame synthesis network model, wherein the frame synthesis network model is pre-trained to a convergent state for A convolutional neural network model for image interpolation;
- the frame synthesis network model performs convolution processing on the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image to generate a visible mask image;
- the frame synthesis network model performs interpolation processing on the visible mask map and the motion vector to generate the up-frame image.
- the training method of the frame insertion model includes:
- sample atlas includes: a first training frame, a second training frame and a sample frame, and the sample frame is located in the first training frame. within the time interval represented by the training frame image and the second training frame image;
- the first training frame image, the second training frame image, the training motion vector and the training intermediate frame image are input into the preset second initial model, wherein, the second initial model is not yet trained to a convergent state, using A convolutional neural network model for interpolating images;
- the weight values in the first initial model and the second initial model are iteratively updated based on the feature difference until the feature difference is less than or equal to up to the loss threshold.
- the weight values in the first initial model and the second initial model are iteratively updated based on the feature difference, until the feature difference is less than or equal to the loss threshold, including: :
- the first initial model trained to a convergent state is the motion vector network model
- the second initial model is the frame synthesis network model
- the frame insertion model includes a loss function
- the loss function is composed of a reconstructed disparity function and a motion vector estimation restoration disparity function weight.
- an embodiment of the present invention also provides a video frame insertion device, including:
- the acquisition module is used to acquire the target video to be processed
- an extraction module configured to extract a first reference frame image and a second reference frame image in the target video, wherein the first reference frame image and the second reference frame image are adjacent on the time axis;
- a processing module configured to combine the first reference frame image and the second reference frame image into a preset frame insertion model, wherein the frame insertion model is pre-trained to convergence, and is used for according to the first frame insertion model
- the motion vector between the reference frame image and the second reference frame image performs the neural network model of frame insertion processing on the target frame image
- the reading module is configured to read the frame-up image output by the frame insertion model, and insert the frame-up image between the first reference frame image and the second reference frame image.
- the image standard is a frame rate threshold
- the video frame insertion device further includes:
- the first acquisition submodule is used to acquire the frame rate data of the video to be played
- a first comparison submodule for comparing the frame rate data with the frame rate threshold
- the first execution sub-module is configured to determine that the video to be played is the target video when the frame rate value represented by the frame rate data is smaller than the frame rate threshold.
- the frame insertion model includes a motion vector network model
- the video frame insertion device further includes:
- the first input sub-module is used for superimposing the channel images of the first reference frame image and the second reference frame image to generate a superimposed image, and inputting the superimposed image into the motion vector network model, wherein the
- the motion vector network model is a convolutional neural network model that is pre-trained to convergence and used to extract motion vectors between images;
- the first generation submodule is used for the motion vector network model to perform convolution pooling processing on the superimposed image through the convolution layer to generate down-sampling features;
- the second generation sub-module is used for the motion vector network model to perform interpolation processing on the down-sampling features through the deconvolution layer to generate up-sampling features;
- the first stacking submodule is configured to perform feature fusion and stacking on the down-sampling feature and the up-sampling feature to generate the motion vector.
- the frame insertion model includes a frame synthesis network model
- the video frame insertion device further includes:
- the third generation sub-module is used to perform interpolation processing on the motion vector to generate an intermediate frame image
- the second input sub-module is configured to combine and input the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image into the frame synthesis network model, wherein the frame synthesis network model is a preset A convolutional neural network model trained to a convergent state for image interpolation;
- the fourth generation sub-module is used for the frame synthesis network model to perform convolution processing on the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image to generate a visible mask image;
- the fifth generation sub-module is used for the frame synthesis network model to perform interpolation processing on the visible mask map and the motion vector to generate the up-frame image.
- the video frame insertion device further includes:
- the first processing submodule is used for framing the pre-collected sample video to generate a sample atlas, wherein the sample atlas includes: a first training frame, a second training frame and a sample frame, the The sample frame image is located within the time interval represented by the first training frame image and the second training frame image;
- the third input sub-module is used to input the first training frame image and the second training frame image into a preset first initial model, wherein the first initial model has not been trained to a convergent state, and is used for A convolutional neural network model that extracts motion vectors between images;
- the first reading submodule is used to read the training motion vector output by the first initial model, and generate a training intermediate frame diagram according to the interpolation of the training motion vector;
- the fourth input sub-module is used to input the first training frame image, the second training frame image, the training motion vector and the training intermediate frame image into a preset second initial model, wherein the second initial model A convolutional neural network model used for image interpolation that has not been trained to a convergent state;
- the second reading sub-module is used to read the training frame-up image output by the second initial model, and calculate the feature difference between the training frame-up image and the sample frame image according to a preset loss function ;
- the second execution sub-module is configured to iteratively update the weight values in the first initial model and the second initial model based on the feature difference when the feature difference is greater than a preset loss threshold, Until the feature difference is less than or equal to the loss threshold.
- the video frame insertion device further includes:
- the second processing sub-module is used to repeatedly and iteratively supervise the training of the frame insertion model through a plurality of the sample atlases, until the frame insertion model meets the preset convergence conditions;
- the third execution sub-module is configured to determine that the first initial model trained to a convergent state is the motion vector network model, and the second initial model is the frame synthesis network model.
- the frame insertion model includes a loss function
- the loss function is composed of a reconstructed disparity function and a motion vector estimation restoration disparity function weight.
- an embodiment of the present invention further provides a computer device, including a memory and a processor, where computer-readable instructions are stored in the memory, and when the computer-readable instructions are executed by the processor, the The processor executes the steps of the chip manufacturing method described above.
- the embodiment of the present invention further provides a storage medium storing computer-readable instructions.
- the computer-readable instructions are executed by one or more processors, the one or more processors execute the above-mentioned The steps of the chip manufacturing method.
- the beneficial effects of the embodiments of the present invention are: when determining the target video that needs to be processed for frame insertion, two adjacent frame images in the target video are read as reference frame images, and the two frame images are extracted from the two reference frame images. Since the motion vector can represent the transitional motion state between the two reference frame images, it can be generated by the motion vector and the two reference frame image inset frame models, between the two reference frame images. Up-frame image. The introduction of the motion vector can make the image of the up-frame image display the intermediate state between the two reference frame images, making the frame insertion result more natural, and greatly improving the user experience.
- FIG. 1 is a schematic flow chart of a basic flow of a video frame insertion method according to a specific embodiment of the present application
- FIG. 2 is a schematic flowchart of a specific embodiment of the present application for screening target videos
- FIG. 3 is a schematic flowchart of a motion vector extraction according to a specific embodiment of the present application.
- FIG. 4 is a schematic flowchart of a second implementation manner of generating an up-frame image according to a specific embodiment of the present application
- FIG. 5 is a schematic flow chart of a single process of training a frame insertion model according to a specific embodiment of the application
- FIG. 6 is a schematic flowchart of the whole process of training a frame insertion model according to a specific embodiment of the application
- FIG. 7 is a schematic diagram of the basic structure of a video frame insertion device according to an embodiment of the present application.
- FIG. 8 is a basic structural block diagram of a computer device according to an embodiment of the present application.
- terminal used here includes both a device of a wireless signal receiver, which only has a device of a wireless signal receiver without transmission capability, and a device of receiving and transmitting hardware, which has a device capable of transmitting a wireless signal.
- a device that performs the receiving and transmitting hardware of the two-way communication On a two-way communication link, a device that performs the receiving and transmitting hardware of the two-way communication.
- Such equipment may include: cellular or other communication equipment, which has a single-line display or a multi-line display or a cellular or other communication device without a multi-line display; PCS (Personal Communications Service, personal communication system), which can combine voice, data Processing, fax and/or data communication capabilities; PDA (Personal Digital Assistant), which may include a radio frequency receiver, pager, Internet/Intranet access, web browser, notepad, calendar and/or GPS (Global Positioning System (Global Positioning System) receiver; conventional laptop and/or palmtop computer or other device having and/or including a conventional laptop and/or palmtop computer or other device with a radio frequency receiver.
- PCS Personal Communications Service, personal communication system
- PDA Personal Digital Assistant
- GPS Global Positioning System (Global Positioning System) receiver
- conventional laptop and/or palmtop computer or other device having and/or including a conventional laptop and/or palmtop computer or other device with a radio frequency receiver.
- Terminal as used herein may be portable, transportable, mounted in a vehicle (air, marine and/or land), or adapted and/or configured to operate locally, and/or in a distributed fashion, to operate Operates anywhere on Earth and/or space.
- the "terminal” used here can also be a communication terminal, an Internet terminal, and a music/video playback terminal, such as a PDA, a MID (Mobile Internet Device) and/or a mobile phone with a music/video playback function, It can also be a smart TV, a set-top box and other devices.
- FIG. 1 is a schematic diagram of a basic flow of a video frame insertion method according to this embodiment.
- the video frame insertion method includes:
- the target video in this embodiment refers to the video to be processed that is selected for frame insertion processing to increase the video frame rate.
- the target video can be a network video sent to the terminal through the server, or a local video stored locally in the terminal.
- the video frame insertion method in this implementation can also be used to process video data uploaded by the terminal.
- the target video is the video uploaded by the terminal.
- the acquisition of the target video needs to be screened, and the screening methods mainly include: screening by bit rate or frame rate.
- the target video is a network transmission video
- the terminal after receiving the video data sent by the server, the terminal reads the bit rate of the video data on the network port, and when the bit rate is lower than the preset bit rate threshold, determines the video The data is the target video.
- the terminal When the video is a local video, the terminal reads the frame rate parameter of the video, and when the value represented by the frame rate parameter is less than the frame rate threshold, it is determined that the video data is the target video.
- the server when the video frame insertion method is used to process the video data uploaded by the terminal, the server reads the bit rate of the data uploaded by the terminal, and when the bit rate is lower than a preset bit rate threshold, determines the uploaded video data for the target video.
- the target video When the target video is determined, two adjacent frame images in the target video are extracted, and the two frame images are defined as the first reference frame image and the second reference frame image.
- the first reference frame image and the second reference frame image are in adjacent on the timeline.
- the acquisition of the first reference frame image and the second reference frame image can be performed by random extraction.
- the target video is framed to convert the target video into multiple frame images arranged along the time axis, and then , selects a picture from the multiple frame pictures as the first reference frame picture by using a random algorithm, and selects an adjacent frame picture before or after the first reference frame picture as the second reference frame picture.
- the selection of the first reference frame image and the second reference frame image needs to consider the requirement of scene transition.
- the transition process will appear deep and unnatural.
- the collected adjacent frame images are input into the transition classification model.
- the transition classification model is supervised and trained to be able to determine whether the two pictures belong to transition images.
- the transition classification model can be trained by a convolutional neural network model, a deep convolutional neural network model, a recurrent neural network model, or a variant model of the above models. Defining the first reference frame image and the second reference frame image as transition images, and then performing frame interpolation processing on the transition images, can further improve the smoothness of video playback.
- the first reference frame image and the second reference frame image are acquired by collecting, and the first reference frame image and the second reference frame image are combined and input into the frame insertion model.
- the user of the frame insertion model performs frame insertion processing on the target video according to the motion vector between the input two images, wherein the frame insertion model is preselected and trained to a convergent state, so the frame insertion can be accurately performed on the target video.
- the pixels of the first reference frame image and the second reference frame image are superimposed.
- the image sizes of the first reference frame image and the second reference frame image are adjusted to be consistent, and the two reference frame images are divided into RGB colors. It is divided into three color channels, namely red, green and blue channels, respectively. Then, taking the channel color as the category, weighted and superimposed the images in the same category.
- the channel images are merged to generate overlay images.
- the merged overlay image is input into the frame insertion model. Since the frame insertion model is trained to extract the motion vector between the first reference frame image and the second reference frame image, the overlay image is convolved through the frame insertion model. After the feature extraction of the layer, the motion vector between the first reference frame image and the second reference frame image is obtained, and the motion vector represents the change state between the first reference frame image and the second reference frame image. Therefore, the frame insertion model is based on The numerical value represented by the motion vector, after the motion vector is pixelized, an up-frame image can be generated.
- the frame insertion model is a joint model, consisting of a motion vector network model and a frame synthesis network model, wherein the motion vector network model is a convolutional neural network that is pre-trained to convergence and used to extract motion vectors between images Model, the frame synthesis network model is a convolutional neural network model that is pre-trained to a convergent state and used to interpolate images.
- the motion vector network model extracts the motion vector, the motion vector, the first reference frame image, the second reference frame image and the relatively rough intermediate frame image generated by the motion vector are used as input parameters, and feature extraction is continued to form a visible mask image. Finally, a more refined up-frame image is generated according to each motion vector of the visible mask map.
- S1400 Read the frame-up image output by the frame insertion model, and insert the frame-up image between the first reference frame image and the second reference frame image.
- the frame-up image After the frame-up image is output through the frame insertion model, the frame-up image that has been generated is read, and the frame-up image is inserted between the first reference frame image and the second reference frame image to complete a frame-up step. Then, the process of S1100-S1400 is continued to be repeated until the bit rate or frame rate of the target video reaches the set bit rate threshold or frame rate threshold, and the frame interpolation operation on the target video is ended.
- two adjacent frame images in the target video are read as reference frame images, and the motion vector between the two frame images is extracted by the two reference frame images, Since the motion vector can represent the transitional motion state between the two reference frame images, an up-frame image between the two reference frame images can be generated through the motion vector and the frame model of the two reference frame images.
- the introduction of the motion vector can make the image of the up-frame image display the intermediate state between the two reference frame images, making the frame insertion result more natural, and greatly improving the user experience.
- the determination of the target video needs to be screened by the frame rate of the video. Please refer to FIG. 2 .
- FIG. 2 is a schematic flowchart of screening a target video according to this embodiment.
- S1100 includes:
- the user terminal When the user terminal plays the video to be played through the instruction, it reads the frame rate data of the to-be-played video.
- the video to be played in this embodiment includes the network video sent by the server and the local video stored in the local storage space of the user terminal.
- the numerical setting of the frame rate threshold can be set according to the minimum standard of the video playback frame rate, or according to the original video frame rate of the video to be played. For example, when the server sends video data to the user terminal, the frame rate data of the server-side video data is sent to the user terminal, and after the user terminal receives the frame rate data sent by the server, the frame rate data is set. is the frame rate threshold.
- the frame rate value represented by the frame rate data is smaller than the frame rate threshold, it is determined that the video to be played is the target video for which frame insertion operation is required.
- the frame rate value represented by the frame rate data is greater than or equal to the frame rate threshold, it is determined that interpolation processing is not required for the video to be played.
- the video in the time period where the freeze video is located is intercepted as the target video, and frame insertion processing is performed on the target video, thereby eliminating the video freeze phenomenon.
- the frame insertion model includes a motion vector network model for extracting motion vectors of the first reference frame image and the second reference frame image.
- FIG. 3 is a schematic flowchart of extracting motion vectors according to this embodiment.
- S1300 includes:
- the superimposed images are input into the motion vector network model, which is a convolutional neural network model that is pre-trained to convergence and used to extract motion vectors between images.
- the model adopted by the motion vector network model is: U-net network model.
- the U-net network structure includes two symmetrical parts: the first part of the network is the same as the ordinary convolutional network, using 3x3 convolution and pooling downsampling, which can capture the context information in the image; the latter part of the network is the same as the previous part. Symmetrical, using 3x3 deconvolution layers and upsampling for output image segmentation purposes.
- feature fusion is also used in the network, and the features of the previous part of the downsampling network are fused with the features of the latter part of the upsampling part to obtain more accurate context information and achieve a better segmentation effect.
- the motion vector network model can also be a U2-net network model.
- the model adopted by the motion vector network model can also be (but not limited to): a convolutional neural network model, a deep convolutional neural network model, a recurrent neural network model or a variant of the above-mentioned neural network model.
- the motion vector network model performs a convolution pooling process on the superimposed image through a convolution layer to generate down-sampling features
- the convolution layer in the motion vector network model After the superimposed image is input into the motion vector network model, the convolution layer in the motion vector network model performs convolution and pooling processing on the superimposed image, and extracts the downsampling features in the superimposed image. In the process of sampling and downsampling, the motion vector network model performs feature extraction and image scaling on the folded image.
- the motion vector network model performs interpolation processing on the down-sampling features through a deconvolution layer to generate up-sampling features
- the motion vector network model After the feature extraction and reduction of the superimposed image through the convolutional layer, the motion vector network model performs interpolation processing on the reduced image through the deconvolution layer that is symmetrical with the convolutional layer. Sampling features, the above-mentioned processing process is up-sampling. During the up-sampling process, image features are extracted by means of interpolation processing and the reduced superimposed image is enlarged.
- the motion vector network model is processed by convolution and deconvolution, the down-sampling features and up-sampling features of the superimposed image are generated, and then the down-sampling features and up-sampling features are fused and superimposed.
- the deconvolved image is weighted with the corresponding features to obtain a fused motion vector.
- the motion vector network model includes: a first convolution layer, a second convolution layer, a third convolution layer, a first deconvolution layer, a second deconvolution layer, and a third deconvolution layer.
- the first convolution layer and the first deconvolution layer are symmetrical to each other
- the second convolution layer and the second deconvolution layer are symmetrical to each other
- the third convolution layer and the third deconvolution layer are symmetrical to each other.
- the first convolution layer performs feature extraction on the superimposed image
- the extracted features are synchronized to the second volume base layer and the first deconvolution layer.
- the second convolution layer performs feature extraction
- the extracted features are synchronized to the third volume.
- the convolutional layer and the second deconvolutional layer, and so on, the superimposed image goes through a "U"-shaped convolutional layer to extract the path, and finally the motion vector is output by the third deconvolutional layer.
- the third deconvolutional layer in the process of feature extraction by the first deconvolution layer, the second deconvolution layer and the third deconvolution layer, it can not only receive the features synchronized by the previous convolution layer, but also receive the features synchronized by the previous convolution layer.
- the features of the corresponding convolutional layers are synchronized, therefore, the features of the down-sampling network are fused with the features of the subsequent up-sampling part to obtain more accurate contextual information.
- the motion vector network model After the motion vector network model obtains the motion vectors of the first reference frame image and the second reference frame image, the vector value in the motion vector is pixelized, and the up-frame image of the first reference frame image and the second reference frame image is generated. .
- FIG. 4 is a schematic flowchart of a second implementation manner of generating an up-frame image in this embodiment.
- the frame insertion model is a joint model, consisting of a motion vector network model and a frame synthesis network model, wherein the motion vector network model is a convolutional neural network model that is pre-trained to convergence and used to extract motion vectors between images , the frame synthesis network model is a convolutional neural network model that is pre-trained to a convergent state and used to interpolate images.
- the output of the motion vector network model is connected to an input channel of the frame synthesis network model.
- the vector value in the motion vector is pixelized to generate a relatively rough intermediate frame image, and the intermediate frame image can also be used as the first reference frame image.
- the reference frame image and the up-frame image of the second reference frame image are used.
- the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image are combined in the following way: weighting the pixel values of the corresponding points of the four images of the same size to generate a new pixel value, and then by the new pixel value.
- the pixel values compose the merged image.
- the manner of image merging input is not limited to this.
- the merging input can be input after splicing the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image.
- Frame synthesis network model (not limited to): a convolutional neural network model, a deep convolutional neural network model, a recurrent neural network model, or a variant model of the above-mentioned neural network model.
- the frame synthesis network model performs convolution processing on the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image to generate a visible mask image;
- the frame synthesis network model performs convolution processing on the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image to generate a visible mask image.
- the visible mask image is an alpha ( ⁇ Channel, alpha channel) value in the range of 0-1, 0 represents the point at which the current position of the generated frame is multiplexed with the value of the current position of the first reference frame image, and 1 represents when the current position is multiplexed when the current position is generated.
- the value of the current position of the two reference frame images, the middle value represents the fusion of the content of the two frames.
- the frame synthesis network model performs interpolation processing on the visible mask map and the motion vector to generate the up-frame image.
- the frame synthesis network model performs interpolation processing on the visible mask map and the motion vector.
- the interpolation processing refers to a given pixel point, and the value of the pixel point is predicted according to the information of the surrounding pixels.
- the technical solutions adopted in the interpolation processing include (not limited to): nearest neighbor method, linear interpolation method, bilinear interpolation method or bicubic interpolation method, etc.
- FIG. 5 is a schematic flowchart of a single process of training a frame insertion model according to this embodiment.
- the training method of the frame insertion model is as follows:
- sample atlas includes: a first training frame, a second training frame, and a sample frame, and the sample frame is located in the within the time interval represented by the first training frame image and the second training frame image;
- sample videos for model training are collected, and the sample videos are framed.
- the framed sequence frame images are packaged as a sample set every 5, and each packaged data is called a sample atlas.
- sample atlas the composition of the sample atlas is not limited to this. According to different specific application scenarios, in some embodiments, 3, 4, 6 or more consecutive frame images in sequence frame images are packaged into sample images set.
- the sample atlas includes: a first training frame, a second training frame, and a sample frame, wherein the sample frame is located within the time interval represented by the first training frame and the second training frame. Specifically, the frame images located in the first sequence and the last sequence in the sample atlas are selected as the first training frame image and the second training frame image, and one frame image is randomly selected as the sample frame image in the remaining frame images.
- the original frames of the sample video are extracted, and then stored in the sequence order of the video playback, the extracted images are scaled to a resolution of 256 pixels wide and 256 pixels high, and finally these sequences are
- the images are packaged according to a group of 5 frames (Frame0, Frame1, Frame2, Frame3, Frame4).
- the middle frame (Frame1, Frame2, Frame3) can be arbitrarily selected as the sample frame image, and Frame0 and Frame4 are respectively used as the first frame.
- a training frame image and a second training frame image can be arbitrarily selected as the sample frame image.
- image enhancement processing needs to be performed on the first training frame image and the second training frame image
- the enhancement processing method includes performing image enhancement processing on the first training frame image and the second training frame image.
- the graph performs (not limited to) operations such as random cropping, random rotation of orientation, and adding random noise.
- the first training frame image and the second training frame image are superimposed and input into the first initial model.
- Image superposition refers to the pixels corresponding to the first training frame image and the second training frame image. Points are weighted.
- the merged first training frame image and second training frame image are input into the first initial model.
- the first initial model is the unconverged state of the motion vector network model, which is also a convolutional neural network model for extracting motion vectors between images.
- the first initial model can be (but is not limited to): a U-net network model, a U2-net network model, a convolutional neural network model, a deep convolutional neural network model, a recurrent neural network model, or a variant of the above-mentioned neural network model.
- the training motion vector output by the first initial model has strong randomness and poor accuracy.
- the accuracy of the output training motion vector will become higher and higher.
- Each vector value represented by the training motion vector output by the first initial model is pixel-painted to generate a training intermediate frame image.
- the training motion vector and the training intermediate frame image are obtained through the first initial model
- the first training frame image, the second training frame image, the training motion vector and the training intermediate frame image are input into the second initial model.
- the second initial model is an unconverged state model of the frame synthesis network model, which also belongs to the convolutional neural network model used for image interpolation.
- the second initial model includes (but is not limited to): a convolutional neural network model, a deep convolutional neural network model, a recurrent neural network model, or a variant model of the above-mentioned neural network model.
- the first training frame image, the second training frame image, the training motion vector and the training intermediate frame image are combined by weighting the corresponding pixel values of the four images of the same size to generate new pixel values,
- the pixel values compose the merged image.
- the second initial model generates a training visible mask image by convolving the merged image, and then performs interpolation processing on the training visible mask image and the training motion vector.
- the technical solutions used in the interpolation processing include (not limited to): nearest neighbor method, linear interpolation, bilinear interpolation or bicubic interpolation, etc.
- the image generated after interpolation processing is the training up-frame image.
- the second initial model When reading the training frame-up image output by the second initial model, it should be pointed out that since the second initial model has not been trained to a convergent state, the output frame-up image has strong randomness and poor accuracy. However, as the training progresses, when the second initial model gradually tends to converge, the accuracy of the output training frame-up image becomes higher and higher.
- the sample frame image is directly used as the labeling image, which eliminates the process of labeling the image in the supervised training process, simplifies the training process of the frame insertion model, and improves the training efficiency.
- the loss function is a composite loss function, and specifically, the loss function is composed of a weighted reconstruction disparity function and a motion vector estimation restoration disparity function.
- the characteristics of the loss function are specifically described as:
- l r represents the reconstruction difference between the sample frame image and the training frame image
- l w represents the motion vector estimation restoration difference between the sample frame image and the training frame image
- ⁇ and ⁇ are parameter values
- N represents the batch size
- I ti represents the sample frame image
- I 0 represents the first training frame image
- I 1 represents the second training frame image
- F 0 ⁇ 1 represents the motion vector between I 0 and I 1
- F 1 ⁇ 0 represents I 1 to I 0
- the motion vector between, g represents the backward restoration network sampling function, which can restore the content of the next frame through the motion vector and the previous frame.
- the loss function is not limited to the loss function types exemplified above. According to different specific application scenarios, the loss function includes (but is not limited to): absolute value loss function, logarithmic loss function, squared loss function, exponential loss Function, Hinge loss function, perceptual loss function, cross entropy loss function, or a composite function composed of two or more functions.
- the feature difference between the sample frame image and the training frame-up image is calculated by the loss function. After it is greater than the preset loss threshold, the weights in the first initial model and the second initial model need to be weighted based on the feature difference through the return function. The value is corrected so that the feature difference between the training frame image and the sample frame image jointly output by the first initial model and the second initial model after correction tends to be less than or equal to the loss threshold.
- the process between S2112-S2116 is repeatedly executed.
- the loss function calculates the feature difference between the sample frame image and the training frame image, which is less than or equal to the loss threshold, the sample atlas is completed. train.
- FIG. 6 is a schematic flowchart of the entire process of training the frame insertion model according to the present embodiment.
- S2116 includes:
- the training of the frame insertion model requires a large number of sample atlases for training. Specifically, different sample atlases are used to repeatedly and iteratively execute the process between S2111 and S2116, and each round of training is used to correct the weights in the frame insertion model. parameter, so that the training frame-up image output by the frame insertion model is more and more close to the sample frame image.
- the convergence conditions are: iterative training about 2,000,000 times, and when the accuracy rate of the model output reaches 95% or higher through the test sample test, the interpolation The frame model meets the convergence condition.
- the setting of the convergence condition is not limited to this. According to different specific application scenarios, in some embodiments, the number of times of iterative training and the setting of the accuracy rate can be set according to actual needs.
- the first initial model and the second initial model are also in a convergent state, and the first initial model is defined as a motion vector network model, and the second initial model is a frame synthesis network model.
- a corresponding device can be constructed by running an application program implementing the foregoing method embodiments in a computer. Please refer to FIG. 7 for details. FIG.
- a video frame insertion device includes: an acquisition module 2100 , an extraction module 2200 , a processing module 2300 and a reading module 2400 .
- the acquisition module 2100 is used to acquire the target video to be processed;
- the extraction module 2200 is used to extract the first reference frame image and the second reference frame image in the target video, wherein the first reference frame image and the The second reference frame images are adjacent on the time axis;
- the processing module 2300 is configured to combine the first reference frame image and the second reference frame image into a preset frame insertion model, wherein the frame insertion model is Pre-trained to convergence, a neural network model for performing frame insertion processing on the target frame image according to the motion vector between the first reference frame image and the second reference frame image;
- the reading module 2400 is used to read the frame-up image output by the frame insertion model, and insert the frame-up image between the first reference frame image and the second reference frame image.
- the video frame insertion device determines the target video that needs to be subjected to frame insertion processing, it reads two adjacent frame images in the target video as reference frame images, and extracts the motion vector between the two frame images through the two reference frame images, Since the motion vector can represent the transitional motion state between the two reference frame images, an up-frame image between the two reference frame images can be generated through the motion vector and the frame model of the two reference frame images.
- the introduction of the motion vector can make the image of the up-frame image display the intermediate state between the two reference frame images, making the frame insertion result more natural, and greatly improving the user experience.
- the image standard is a frame rate threshold
- the apparatus for video frame insertion further includes: a first acquisition submodule, a first comparison submodule, and a first execution submodule.
- the first acquisition sub-module is used to acquire the frame rate data of the video to be played;
- the first comparison sub-module is used to compare the frame rate data with the frame rate threshold;
- the first execution sub-module is used to compare the frame rate data when If the frame rate value represented by the frame rate data is smaller than the frame rate threshold, it is determined that the to-be-played video is the target video.
- the frame insertion model includes a motion vector network model
- the video frame insertion device further includes: a first input sub-module, a first generation sub-module, a second generation sub-module and a first overlay sub-module.
- the first input sub-module is used to superimpose the channel images of the first reference frame image and the second reference frame image to generate an overlay image, and input the superimposed image into the motion vector network model
- the motion vector network model is a convolutional neural network model that is pre-trained to convergence and used to extract motion vectors between images; the first generation sub-module is used for the motion vector network model to perform the superimposed image on the superimposed image through the convolution layer.
- the convolution pooling process generates down-sampling features; the second generation sub-module is used for the motion vector network model to perform interpolation processing on the down-sampling features through the deconvolution layer to generate up-sampling features; the first superposition sub-module uses The motion vector is generated by performing feature fusion and stacking on the down-sampling feature and the up-sampling feature.
- the frame insertion model includes a frame synthesis network model
- the video frame insertion apparatus further includes: a third generation sub-module, a second input sub-module, a fourth generation sub-module and a fifth generation sub-module.
- the third generation sub-module is used to perform interpolation processing on the motion vector to generate an intermediate frame image
- the second input sub-module is used to convert the motion vector, intermediate frame image, first reference frame image and second reference frame image
- the input is combined into the frame synthesis network model, wherein the frame synthesis network model is a convolutional neural network model that is pre-trained to a convergent state and used for image interpolation processing;
- the fourth generation sub-module is used for the frame
- the synthesis network model performs convolution processing on the motion vector, the intermediate frame image, the first reference frame image and the second reference frame image to generate a visible mask image;
- the fifth generation sub-module is used for the frame synthesis network model to The visible mask map and the motion vector are interpolated to generate the up-frame image.
- the video frame insertion apparatus further includes: a first processing sub-module, a third input sub-module, a first reading sub-module, a fourth input sub-module, a second reading sub-module and a second executing sub-module .
- the first processing sub-module is used to frame the pre-collected sample video to generate a sample atlas, wherein the sample atlas includes: a first training frame, a second training frame, and a sample frame.
- the sample frame image is located in the time interval represented by the first training frame image and the second training frame image; the third input sub-module is used to input the first training frame image and the second training frame image into a preset
- the first initial model is a convolutional neural network model that has not been trained to a convergent state and is used to extract motion vectors between images
- the first reading submodule is used to read the first The training motion vector output by the initial model, and the training intermediate frame image is generated according to the interpolation of the training motion vector
- the fourth input sub-module is used to combine the first training frame image, the second training frame image, the training motion vector and the training intermediate frame
- the image is input into a preset second initial model, wherein the second initial model is a convolutional neural network model that has not been trained to a convergent state and is used to interpolate images; the second reading submodule is used to read Take the training frame-up image output by the second initial model, and calculate the feature difference between the training frame-up image
- the video frame insertion apparatus further includes: a second processing submodule and a third execution submodule.
- the second processing sub-module is used to repeatedly and iteratively supervise the training of the frame insertion model through several sample atlases until the frame insertion model meets the preset convergence condition;
- the third execution sub-module is used for It is determined that the first initial model trained to a convergent state is the motion vector network model, and the second initial model is the frame synthesis network model.
- the video frame interpolation device further includes: the frame interpolation model includes a loss function, and the loss function is composed of a reconstructed disparity function and a motion vector estimation restoration disparity function weight.
- an embodiment of the present application further provides a computer device for running a computer program implemented according to the video frame insertion method.
- FIG. 8 is a block diagram of a basic structure of a computer device according to this embodiment.
- the computer device includes a processor, non-volatile storage medium, memory, and a network interface connected by a system bus.
- the non-volatile storage medium of the computer device stores an operating system, a database and computer-readable instructions
- the database may store a sequence of control information.
- the processor can realize a A video frame interpolation method.
- the processor of the computer equipment is used to provide computing and control capabilities to support the operation of the entire computer equipment.
- Computer-readable instructions may be stored in the memory of the computer device, and when executed by the processor, the computer-readable instructions may cause the processor to execute a video frame interpolation method.
- the network interface of the computer equipment is used for communication with the terminal connection.
- FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer equipment to which the solution of the present application is applied. Include more or fewer components than shown in the figures, or combine certain components, or have a different arrangement of components.
- the processor is used to execute the specific functions of the acquisition module 2100 , the extraction module 2200 , the processing module 2300 and the reading module 2400 in FIG. 7 , and the memory stores program codes and various data required to execute the above modules.
- the network interface is used for data transmission between user terminals or servers.
- the memory in this embodiment stores the program codes and data required for executing all sub-modules in the video frame insertion device, and the server can call the server's program codes and data to execute the functions of all the sub-modules.
- the computer device determines the target video that needs to perform frame insertion processing, it reads two adjacent frame images in the target video as reference frame images, and extracts the motion vector between the two frame images through the two reference frame images, because, The motion vector can represent the transitional motion state between the two reference frame images. Therefore, an up-frame image between the two reference frame images can be generated by using the motion vector and the frame insertion model of the two reference frame images.
- the introduction of the motion vector can make the image of the up-frame image display the intermediate state between the two reference frame images, making the frame insertion result more natural, and greatly improving the user experience.
- the present application also provides a non-volatile storage medium, the video frame insertion method is written as a computer program, and stored in the storage medium in the form of computer-readable instructions, and the computer-readable instructions are processed by one or more When the processor is executed, it means that the program runs in the computer, thereby causing one or more processors to execute the steps of the video frame interpolation method in any of the foregoing embodiments.
- the realization of all or part of the processes in the methods of the above embodiments can be accomplished by instructing relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium, and the program is During execution, it may include the processes of the embodiments of the above-mentioned methods.
- the aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM) or the like.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Molecular Biology (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Image Analysis (AREA)
Abstract
一种视频插帧方法、装置、计算机设备及存储介质,包括:获取待处理的目标视频;提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第二参考帧图之间的运动向量对所述目标视频,进行插帧处理的神经网络模型;读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。运动向量的引入,能够使升帧图像的图像画面显示两张参考帧图之间的中间状态,使插帧结果更加自然,极大的提高了用户体验。
Description
本申请要求于2020年12月29日提交中国专利局、申请号为202011603134.4、发明名称为“视频插帧方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本发明实施例涉及视频处理领域,尤其是一种视频插帧方法、装置、计算机设备及存储介质。
帧率是以帧称为单位的位图图像连续出现在显示器上的频率(速率)。帧率的多少直接影响到视频播放时的流畅度,帧率高的视频播放流畅度较好,反之,则越差,当帧率过低时视频播放就回出现卡顿现象。
本发明创造的发明人在研究中发现,在视频网络直播过程中,因用户网络状况多重多样,在不理想的情况下,需要降低传输视频的码率,而降低码率的方式包括:降低视频分辨率或者降低视频帧率,而现有技术中,为了保证视频画质往往对视频进行降帧处理,降低帧率则会降低视频流播放畅度,影响用户的观看体验。
发明内容
本发明实施例提供一种能够提升视频播放流畅度的视频插帧方法、装置、计算机设备及存储介质。
为解决上述技术问题,本发明创造的实施例采用的一个技术方案是:提供一种视频插帧方法,包括:
获取待处理的目标视频;
提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;
将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第 二参考帧图之间的运动向量对所述目标帧图,进行插帧处理的神经网络模型;
读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。
可选地,所述图像标准为帧率阈值,所述获取待处理的目标视频包括:
获取待播放视频的帧率数据;
将所述帧率数据与所述帧率阈值进行比对;
当所述帧率数据表征的帧率值小于所述帧率阈值,确定所述待播放视频为所述目标视频。
可选地,所述插帧模型包括运动向量网络模型,所述将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中包括:
将所述第一参考帧图和第二参考帧图进行通道图像叠加后生成叠加图像,并将所述叠加图像输入至所述运动向量网络模型中,其中,所述运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型;
所述运动向量网络模型通过卷积层对所述叠加图像进行卷积池化处理,生成下采样特征;
所述运动向量网络模型通过反卷积层对所述下采样特征进行插值处理,生成上采样特征;
将所述下采样特征和所述上采样特征进行特征融合叠加生成所述运动向量。
可选地,所述插帧模型包括帧合成网络模型,所述将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中包括:
对所述运动向量进行插值处理生成中间帧图;
将所述运动向量、中间帧图、第一参考帧图和第二参考帧图合并输入至所述帧合成网络模型中,其中,所述帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;
所述帧合成网络模型对所述运动向量、中间帧图、第一参考帧图和第二参考帧图进行卷积处理,生成可见掩码图;
所述帧合成网络模型对所述可见掩码图和所述运动向量进行插值处理,生成所述升帧图像。
可选地,所述插帧模型的训练方法包括:
对预先采集的样本视频进行帧化处理生成样本图集,其中,所述样本图集包括:第一训练帧图、第二训练帧图和样本帧图,所述样本帧图位于所述第一训练帧图和第二训练帧图表征的时间区间内;
将所述第一训练帧图和第二训练帧图输入至预设的第一初始模型中,其中,所述第一初始模型为尚未训练至收敛状态,用于提取图像之间运动向量的卷积神经网络模型;
读取所述第一初始模型输出的训练运动向量,并根据训练运动向量的插值生成训练中间帧图;
将所述第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图输入至预设的第二初始模型中,其中,所述第二初始模型为尚未训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;
读取所述第二初始模型输出的训练升帧图像,并根据预设的损失函数计算所述训练升帧图像与所述样本帧图之间的特征差值;
当所述特征差值大于预设的损失阈值,基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特征差值小于等于所述损失阈值为止。
可选地,所述基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特征差值小于等于所述损失阈值为止之后包括:
通过若干所述样本图集对所述插帧模型进行反复迭代的监督训练,直至所述插帧模型符合预设的收敛条件为止;
确定训练至收敛状态的所述第一初始模型为所述运动向量网络模型,所述第二初始模型为所述帧合成网络模型。
可选地,所述插帧模型包括损失函数,所述损失函数由重构差异函数和运动向量估计还原差异函数加权组成。
为解决上述技术问题,本发明实施例还提供一种视频插帧装置,包括:
获取模块,用于获取待处理的目标视频;
提取模块,用于提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;
处理模块,用于将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第二参考帧图之间的运动向量对所述目标帧图,进行插帧处理的神经网络模型;
读取模块,用于读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。
可选地,所述图像标准为帧率阈值,所述视频插帧装置还包括:
第一获取子模块,用于获取待播放视频的帧率数据;
第一比对子模块,用于将所述帧率数据与所述帧率阈值进行比对;
第一执行子模块,用于当所述帧率数据表征的帧率值小于所述帧率阈值,确定所述待播放视频为所述目标视频。
可选地,所述插帧模型包括运动向量网络模型,所述视频插帧装置还包括:
第一输入子模块,用于将所述第一参考帧图和第二参考帧图进行通道图像叠加后生成叠加图像,并将所述叠加图像输入至所述运动向量网络模型中,其中,所述运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型;
第一生成子模块,用于所述运动向量网络模型通过卷积层对所述叠加图像进行卷积池化处理,生成下采样特征;
第二生成子模块,用于所述运动向量网络模型通过反卷积层对所述下采样特征进行插值处理,生成上采样特征;
第一叠加子模块,用于将所述下采样特征和所述上采样特征进行特征融合叠加生成所述运动向量。
可选地,所述插帧模型包括帧合成网络模型,所述视频插帧装置还包括:
第三生成子模块,用于对所述运动向量进行插值处理生成中间帧图;
第二输入子模块,用于将所述运动向量、中间帧图、第一参考帧图和第二参考帧图合并输入至所述帧合成网络模型中,其中,所述帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;
第四生成子模块,用于所述帧合成网络模型对所述运动向量、中间帧图、第一参考帧图和第二参考帧图进行卷积处理,生成可见掩码图;
第五生成子模块,用于所述帧合成网络模型对所述可见掩码图和所述运动向量进行插值处理,生成所述升帧图像。
可选地,所述视频插帧装置还包括:
第一处理子模块,用于对预先采集的样本视频进行帧化处理生成样本图集,其中,所述样本图集包括:第一训练帧图、第二训练帧图和样本帧图,所述样本帧图位于所述第一训练帧图和第二训练帧图表征的时间区间内;
第三输入子模块,用于将所述第一训练帧图和第二训练帧图输入至预设的第一初始模型中,其中,所述第一初始模型为尚未训练至收敛状态,用于提取图像之间运动向量的卷积神经网络模型;
第一读取子模块,用于读取所述第一初始模型输出的训练运动向量,并根据训练运动向量的插值生成训练中间帧图;
第四输入子模块,用于将所述第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图输入至预设的第二初始模型中,其中,所述第二初始模型为尚未训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;
第二读取子模块,用于读取所述第二初始模型输出的训练升帧图像,并根据预设的损失函数计算所述训练升帧图像与所述样本帧图之间的特征差值;
第二执行子模块,用于当所述特征差值大于预设的损失阈值,基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特征差值小于等于所述损失阈值为止。
可选地,所述视频插帧装置还包括:
第二处理子模块,用于通过若干所述样本图集对所述插帧模型进行反复迭代的监督训练,直至所述插帧模型符合预设的收敛条件为止;
第三执行子模块,用于确定训练至收敛状态的所述第一初始模型为所述运动向量网络模型,所述第二初始模型为所述帧合成网络模型。
可选地,所述插帧模型包括损失函数,所述损失函数由重构差异函数和运动向量估计还原差异函数加权组成。
为解决上述技术问题本发明实施例还提供一种计算机设备,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行上述所述芯片制程方法的步骤。
为解决上述技术问题本发明实施例还提供一种存储有计算机可读指令的存储介质,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行上述所述芯片制程方法的步骤。
本发明实施例的有益效果是:当确定需要进行插帧处理的目标视频时,读取目标视频中两个相邻的帧图像作为参考帧图,通过两张参考帧图提取两张帧图之间的运动向量,由于,运动向量能够表征两张参考帧图之间的过渡运动状态,因此,通过运动向量以及两张参考帧图插帧模型就能够生成,介于两张参考图之间的升帧图像。运动向量的引入,能够使升帧图像的图像画面显示两张参考帧图之间的中间状态,使插帧结果更加自然,极大的提高了用户体验。
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1为本申请一个具体实施例的视频插帧方法基本流程示意图;
图2为本申请一个具体实施例的筛选目标视频的流程示意图;
图3为本申请一具体实施例的提取运动向量的流程示意图;
图4为本申请一个具体实施例的生成升帧图像的第二种实施方式流程示意图;
图5为本申请一个具体实施例的训练插帧模型单一流程的流程示意 图;
图6为本申请一个具体实施例的训练插帧模型整流程的流程示意图;
图7为本申请一个实施例的视频插帧装置基本结构示意图;
图8为本申请一个实施例的计算机设备的基本结构框图。
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能解释为对本申请的限制。
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本申请的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本申请所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。
本技术领域技术人员可以理解,这里所使用的“终端”既包括无线信号接收器的设备,其仅具备无发射能力的无线信号接收器的设备,又包括接收和发射硬件的设备,其具有能够在双向通信链路上,执行双向通信的接收和发射硬件的设备。这种设备可以包括:蜂窝或其他通信设备,其具有单线路显示器或多线路显示器或没有多线路显示器的蜂窝或其他通信设备;PCS(Personal Communications Service,个人通信系统),其可以组合语音、数据处理、传真和/或数据通信能力;PDA(Personal Digital Assistant,个人数字助理),其可以包括射频接收器、寻呼机、互联网/内联网访问、网络浏览器、记事本、日历和/或GPS(Global Positioning System,全球定 位系统)接收器;常规膝上型和/或掌上型计算机或其他设备,其具有和/或包括射频接收器的常规膝上型和/或掌上型计算机或其他设备。这里所使用的“终端”可以是便携式、可运输、安装在交通工具(航空、海运和/或陆地)中的,或者适合于和/或配置为在本地运行,和/或以分布形式,运行在地球和/或空间的任何其他位置运行。这里所使用的“终端”还可以是通信终端、上网终端、音乐/视频播放终端,例如可以是PDA、MID(Mobile Internet Device,移动互联网设备)和/或具有音乐/视频播放功能的移动电话,也可以是智能电视、机顶盒等设备。
请参阅图1,图1为本实施例视频插帧方法基本流程示意图。
如图1所示,视频插帧方法包括:
S1100、获取待处理的目标视频;
本实施方式中的目标视频是指被选定用于进行插帧处理,提升视频帧率的待处理视频。
目标视频能够为通过服务器端发送至终端中的网络视频,也能够是存储在终端本地的本地视频。根据具体实施方式的不同,在一些实施方式中,本实施方式中的视频插帧方法还能够被用于处理终端上传的视频数据,此时,目标视频即为终端上传的视频。
目标视频的取得需要进行筛选,筛选的方式主要包括:通过码率或者帧率进行筛选。具体地,当目标视频为网络传输视频时,终端接收到服务器端发送的视频数据后,读取网络端口该视频数据的码率,当码率低于预设的码率阈值时,确定该视频数据为目标视频。当视频为本地视频时,终端读取该视频的帧率参数,当帧率参数表征的数值小于帧率阈值时,确定该视频数据为目标视频。在一些实施方式中,视频插帧方法被用于处理终端上传的视频数据时,服务器端读取终端上传数据的码率,当码率低于预设的码率阈值时,确定该上传视频数据为目标视频。
S1200、提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;
当确定目标视频后,提取目标视频中的相邻的两张帧图,定义这两张帧图为第一参考帧图和第二参考帧图,第一参考帧图和第二参考帧图在时 间轴上相邻。
第一参考帧图和和第二参考帧图的获取能够采用随机抽取的方式进行采集,例如,将目标视频进行帧化处理,使目标视频转化为沿时间轴排布的多张帧图,然后,在多张帧图中通过随机算法抽取一张图片作为第一参考帧图,选取第一参考帧图之前或者之后相邻的一张帧图作为第二参考帧图。
在一些实施方式中,为了使插帧后的视频播放更加的流畅,第一参考帧图和第二参考帧图的选取需要考虑场景转换的需求。当视频中场景转场中如果没有位于中间态的过渡场景,转场的过程就回显得深硬不够自然。在选取第一参考帧图和第二参考帧图时,将采集到的相邻帧图输入至转场分类模型中,转场分类模型为通过监督训练,能够对两张图片是否属于转场图像的神经网络模型,此处,转场分类模型能够由卷积神经网络模型、深度卷积神经网络模型和循环神经网络模型或者上述模型的变种模型训练得到。将第一参考帧图和第二参考帧图限定为转场图像,然后,对转场图像进行插帧处理,能够更进一步的提升视频播放流畅度。
S1300、将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第二参考帧图之间的运动向量对所述目标视频进行插帧处理的神经网络模型;
采集得到第一参考帧图和第二参考帧图,将第一参考帧图和第二参考帧图合并输入至插帧模型中。
本实施方式中,插帧模型用户根据输入的两张图像之间的运动向量对目标视频进行插帧处理,其中,插帧模型预选训练至收敛状态,因此,能够准确的对目标视频进行插帧。
具体地,将第一参考帧图和第二参考帧图进行像素叠加,像素叠加的时候,第一参考帧图和第二参考帧图的图像尺寸调整一致,将两张参考帧图按RGB颜色分别拆分成三个颜色通道,分别为红色、绿色和蓝色通道,然后,以通道颜色为类别,将同类别中的图像进行加权叠加,三个通道分别叠加后,将叠加后的三个通道图像进行合并生成叠加图像。
将合并后的叠加图像输入至插帧模型中,由于,插帧模型被训练用于提取第一参考帧图和第二参考帧图之间的运动向量,因此,叠加图像通过插帧模型卷积层进行特征提取后,得到第一参考帧图和第二参考帧图之间的运动向量,运动向量表征第一参考帧图和第二参考帧图之间的变化态,因此,插帧模型根据运动向量表征的数值,对运动向量进行像素话后,就能够生成升帧图像。
在一些实施方式中,插帧模型为联合模型,由运动向量网络模型和帧合成网络模型组成,其中,运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型,帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型。运动向量网络模型提取运动向量后,将运动向量、第一参考帧图、第二参考帧图和由运动向量生成的较为粗糙的中间帧图作为入参,继续进行特征提取形成可见掩码图,最后,根据可见掩码图各运动向量生成更加精细的升帧图像。
S1400、读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。
通过插帧模型输出升帧图像后,读取已经生成的升帧图像,并将升帧图像插入到第一参考帧图和第二参考帧图之间完成一个升帧步骤。然后,继续重复S1100-S1400的过程,直至目标视频的码率或者帧率达到设定的码率阈值或者帧率阈值后,结束对目标视频的插帧操作。
上述实施方式,当确定需要进行插帧处理的目标视频时,读取目标视频中两个相邻的帧图像作为参考帧图,通过两张参考帧图提取两张帧图之间的运动向量,由于,运动向量能够表征两张参考帧图之间的过渡运动状态,因此,通过运动向量以及两张参考帧图插帧模型就能够生成,介于两张参考图之间的升帧图像。运动向量的引入,能够使升帧图像的图像画面显示两张参考帧图之间的中间状态,使插帧结果更加自然,极大的提高了用户体验。
在一些实施方式中,目标视频的确定需要通过视频的帧率进行筛选。请参阅图2,图2为本实施例筛选目标视频的流程示意图。
如图2所示,S1100之前包括:
S1111、获取待播放视频的帧率数据;
用户终端通过指令对待播放的视频进行播放时,读取该待播放视频的帧率数据。
本实施方式中的待播放视频包括由服务器端发送的网络视频,以及存储在用户终端本地存储空间内的本地视频。
S1112、将所述帧率数据与所述帧率阈值进行比对;
将获取到的帧率数据与预设的帧率阈值进行比对,其中,帧率阈值的数值设定能够根据视频播放帧率的最低标准设定,也能够根据待播放视频的原视频帧率进行设定,例如,当服务器向用户终端发送视频数据时,将服务器端视频数据的帧率数据发送至用户终端,用户终端接收到服务器端发送的帧率数据后,将该帧率数据设定为帧率阈值。
S1113、当所述帧率数据表征的帧率值小于所述帧率阈值,确定所述待播放视频为所述目标视频。
当帧率数据表征的帧率值小于帧率阈时,则确定待播放视频为需要进行插帧操作的目标视频。当帧率数据表征的帧率值大于等于帧率阈时,则确定该待播放视频无需进行插值处理。
在一些实施方式中,当播放视频中出现卡顿时,截取卡顿视频所在的时间段的视频为目标视频并对目标视频进行插帧处理,进而消除视频卡顿现象。
在一些实施方式中,插帧模型包括运动向量网络模型,运动向量网络模型用于提取第一参考帧图和第二参考帧图的运动向量。请参阅图3,图3为本实施例提取运动向量的流程示意图。
如图3所示,S1300包括:
S1311、将所述第一参考帧图和第二参考帧图进行通道图像叠加后生成叠加图像,并将所述叠加图像输入至所述运动向量网络模型中,其中,所述运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型;
将第一参考帧图和第二参考帧图进行像素叠加,像素叠加的时候,第一参考帧图和第二参考帧图的图像尺寸调整一致,将两张参考帧图按RGB 颜色分别拆分成三个颜色通道,分别为红色、绿色和蓝色通道,然后,以通道颜色为类别,将同类别中的图像进行加权叠加,三个通道分别叠加后,将叠加后的三个通道图像进行合并生成叠加图像。
将叠加图像输入到运动向量网络模型中,运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型。
在一些实施方式中,运动向量网络模型采用的模型为:U-net网络模型。U-net网络结构包括两个对称部分:前面一部分网络与普通卷积网络相同,使用了3x3的卷积和池化下采样,能够抓住图像中的上下文信息;后面部分网络则是与前面基本对称,使用的是3x3反卷积层和上采样,以达到输出图像分割的目的。此外,网络中还用到了特征融合,将前面部分下采样网络的特征与后面上采样部分的特征进行了融合以获得更准确的上下文信息,达到更好的分割效果。在一些实施方式中,运动向量网络模型还能够为U2-net网络模型。
在一些实施方式中,运动向量网络模型采用的模型还能够为(不限于):卷积神经网络模型、深度卷积神经网络模型、循环神经网络模型或者上述神经网络模型的变种模型。
S1312、所述运动向量网络模型通过卷积层对所述叠加图像进行卷积池化处理,生成下采样特征;
叠加图像被输入至运动向量网络模型中后,运动向量网络模型中的卷积层对叠加图像进行卷积和池化处理,提取叠加图像中的下采样特征,上述这个流程成为对叠加图像进行下采样,下采样的过程中,运动向量网络模型对折叠图像进行特征提取和图像缩放。
S1313、所述运动向量网络模型通过反卷积层对所述下采样特征进行插值处理,生成上采样特征;
通过卷积层对叠加图像进行特征提取和缩小后,运动向量网络模型通过与卷积层对对称的反卷积层对缩小后的图像进行插值处理,插值处理的过程中同时提取叠加图像的上采样特征,上述这个处理过程为上采样,上采样的过程中通过插值处理的方式提取图像特征并放大被缩小的叠加图像。
S1314、将所述下采样特征和所述上采样特征进行特征融合叠加生成所述运动向量。
运动向量网络模型在经过卷积和反卷积处理后,生成叠加图像的下采样特征和上采样特征,然后,对下采样特征和上采样特征进行融合叠加,融合叠加的过程就是对卷积和反卷积图像进行对应的特征进行加权得到一个融合后的运动向量。
具体地,运动向量网络模型包括:第一卷积层、第二卷积层、第三卷积层、第一反卷积层、第二反卷积层和第三反卷积层。其中,第一卷积层与第一反卷积层相互对称,第二卷积层与第二反卷积层相互对称,第三卷积层与第三反卷积层相互对称。第一卷积层对叠加图像进行特征提取后,将提取的特征同步至第二卷基层和第一反卷积层中,第二卷积层进行特征提取后,将提取的特征同步到第三卷积层和第二反卷积层,以此类推,叠加图像经过一个“U”形卷积层提取路径后,最终由第三反卷积层输出运动向量。在这个过程中,第一反卷积层、第二反卷积层和第三反卷积层进行特征提取的过程中,既能够接收由上一级卷积层同步的特征,又能够接收由与之对应的卷积层同步的特征,因此,下采样网络的特征与后面上采样部分的特征进行了融合以获得更准确的上下文信息。
运动向量网络模型在得到第一参考帧图和第二参考帧图的运动向量后,将运动向量中的向量值进行像素化,生成了第一参考帧图和第二参考帧图的升帧图像。
在一些实施方式中,为了进一步的提高升帧图像的准确度,需要进一步的对运动向量进行处理。请参阅图4,图4为本实施例生成升帧图像的第二种实施方式流程示意图。
如图4所示,S1314之后,包括:
S1321、对所述运动向量进行插值处理生成中间帧图;
本实施方式中,插帧模型为联合模型,由运动向量网络模型和帧合成网络模型组成,其中,运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型,帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型。运动向量网络模型 的输出连接至帧合成网络模型的一个输入通道中。
运动向量网络模型在得到第一参考帧图和第二参考帧图的运动向量后,将运动向量中的向量值进行像素化,生成了较为粗糙的中间帧图,中间帧图也能够作为第一参考帧图和第二参考帧图的升帧图像使用。
S1322、将所述运动向量、中间帧图、第一参考帧图和第二参考帧图合并输入至所述帧合成网络模型中,其中,所述帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;
将运动向量、中间帧图、第一参考帧图和第二参考帧图合并,合并的方式为:将相同大小的四张图片对应各点像素值进行加权,生成新的像素值,然后由新的像素值组成合并图像。但是,图像合并输入的方式不局限于此,在一些实施方式中,合并输入能够是将运动向量、中间帧图、第一参考帧图和第二参考帧图进行拼接后输入。
帧合成网络模型(不限于):卷积神经网络模型、深度卷积神经网络模型、循环神经网络模型或者上述神经网络模型的变种模型。
S1323、所述帧合成网络模型对所述运动向量、中间帧图、第一参考帧图和第二参考帧图进行卷积处理,生成可见掩码图;
帧合成网络模型对运动向量、中间帧图、第一参考帧图和第二参考帧图进行卷积处理,生成可见掩码图。
可见掩码图是一个范围0-1的alpha(αChannel,阿尔法通道)值,0代表生成帧当前位置的点复用第一参考帧图的当前位置的值,1代表当生成当前位置复用第二参考帧图的当前位置的值,中间数值代表两帧内容的融合。
S1324、所述帧合成网络模型对所述可见掩码图和所述运动向量进行插值处理,生成所述升帧图像。
帧合成网络模型对可见掩码图和运动向量进行插值处理,插值处理是指给定一个像素点,根据它周围像素点的信息来对该像素点的值进行预测。通过可见掩码图和运动向量进行插值处理,能够合成介于第一参考帧图和第二参考帧图之间中间态的升帧图像。
插值处理采用的技术方案包括(不限于):最近邻法、线性插值法、双 线性插值法或双三次插值法等。
在一些实施方式中,需要将插帧模型训练至收敛状态。请参阅图5,图5为本实施例训练插帧模型单一流程的流程示意图。
如图5所示,插帧模型的训练方法如下:
S2111、对预先采集的样本视频进行帧化处理生成样本图集,其中,所述样本图集包括:第一训练帧图、第二训练帧图和样本帧图,所述样本帧图位于所述第一训练帧图和第二训练帧图表征的时间区间内;
在进行插帧模型训练时,首先应当准备用于模型训练的样本。本实施方式中,训练样本的准备过程如下:采集用于进行模型训练的样本视频,将样本视频进行帧化处理,帧化处理就是将样本视频拆分成按时间轴排布的若干帧图。将帧化处理后的序列帧图,按每5张为一个样本集进行打包,每一个打包数据我们称之为一个样本图集。但是,样本图集的组成不局限于此,根据具体应用场景的不同,在一些实施方式中,将序列帧图中连续3张、4张、6张或者更多张的帧图打包成样本图集。
样本图集中包括:第一训练帧图、第二训练帧图和样本帧图,其中,样本帧图位于第一训练帧图和第二训练帧图表征的时间区间内。具体地,将样本图集中位于第一序列和最后序列的帧图选为第一训练帧图和第二训练帧图,在剩余的帧图中随机选择一张帧图作为样本帧图。
例如,在一些实施方式中对样本视频的原始帧进行提取,然后按照视频播放的序列顺序存放,对提取出来图像进行缩放到分辨率为宽为256像素,高为256像素值,最后将这些序列图像按照5帧(Frame0,Frame1,Frame2,Frame3,Frame4)一组进行打包处理,在训练过程中,可以任意选取中间1帧(Frame1,Frame2,Frame3)作为样本帧图,Frame0和Frame4分别作为第一训练帧图和第二训练帧图。
在一些实施方式中,为了增强插帧模型的鲁棒性,需要对第一训练帧图和第二训练帧图进行图像增强处理,增强处理的方式包括对第一训练帧图和第二训练帧图进行(不限于):随机裁剪、方向的随机旋转和添加随机噪声等操作。
S2112、将所述第一训练帧图和第二训练帧图输入至预设的第一初始模 型中,其中,所述第一初始模型为尚未训练至收敛状态,用于提取图像之间运动向量的卷积神经网络模型;
使用样本图集进行模型训练时,将第一训练帧图和第二训练帧图叠加输入至第一初始模型中,图像叠加是指将第一训练帧图和第二训练帧图对应点的像素点进行加权运算。
将合并后的第一训练帧图和第二训练帧图输入至第一初始模型中。第一初始模型为运动向量网络模型的未收敛状态,同样是提取图像之间运动向量的卷积神经网络模型。
第一初始模型能够为(不限于):U-net网络模型、U2-net网络模型、卷积神经网络模型、深度卷积神经网络模型、循环神经网络模型或者上述神经网络模型的变种模型。
S2113、读取所述第一初始模型输出的训练运动向量,并根据训练运动向量的插值生成训练中间帧图;
读取第一初始模型输出的训练运动向量,需要指出的是,由于,第一初始模型尚未训练至收敛状态,因此,其输出的训练运动向量随机性较强,准确性较差。但是随着训练的进行,第一初始模型慢慢趋向于收敛时,输出的训练运动向量准确性会越来越高。
将第一初始模型输出的训练运动向量表征的各向量值进行像素画,生成训练中间帧图。
S2114、将所述第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图输入至预设的第二初始模型中,其中,所述第二初始模型为尚未训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;
通过第一初始模型得到训练运动向量和训练中间帧图后,将第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图输入至第二初始模型中。
第二初始模型为帧合成网络模型的未收敛状态模型,同样属于用于对图像进行插值处理的卷积神经网络模型。
第二初始模型包括(不限于):卷积神经网络模型、深度卷积神经网络模型、循环神经网络模型或者上述神经网络模型的变种模型。
第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图进行合并的方式为:将相同大小的四张图片对应各点像素值进行加权,生成新的像素值,然后由新的像素值组成合并图像。
第二初始模型通过将合并图像进行卷积后,生成训练可见掩码图,然后,对训练可见掩码图和训练运动向量进行插值处理,插值处理采用的技术方案包括(不限于):最近邻法、线性插值法、双线性插值法或双三次插值法等。插值处理后生成的图像就是训练升帧图像。
S2115、读取所述第二初始模型输出的训练升帧图像,并根据预设的损失函数计算所述训练升帧图像与所述样本帧图之间的特征差值;
读取由第二初始模型输出的训练升帧图像,需要指出的是,由于,第二初始模型尚未训练至收敛状态,因此,其输出的升帧图像的随机性较强,准确性较差。但是,随着训练的进行,第二初始模型慢慢趋向于收敛时,输出的训练升帧图像的准确性越来越高。
读取训练升帧图像后,使用损失函数将其与样本帧图进行比对,通过损失函数计算训练升帧图像和样本帧图之间的特征差值。
本实施方式中,将样本帧图直接作为标注图像使用,免去了监督训练过程中标注图像的流程,简化了插帧模型的训练流程,提高了训练的效率。
本实施方式中,损失函数为复合损失函数,具体地,损失函数由重构差异函数和运动向量估计还原差异函数加权组成。损失函数的特征具体描述为:
loss=αl
r+βl
w
其中,
l
w=||I
0-g(I
1,F
0→1)||
1+||I
1-g(I
0,F
1→0)||
1
l
r表示样本帧图与训练升帧图像之间的重构差异,l
w表示表示样本帧图与训练升帧图像之间的运动向量估计还原差异,α和β为参数值,N表示批大小,I
ti表示样本帧图,
表示训练升帧图像,I
0表示第一训练帧图,I
1表示第二训练帧图,F
0→1表示I
0到I
1之间的运动向量,F
1→0表示I
1到I
0之间的运动向量,g表示后向还原网络采样函数,可以通过运动向量和前面一 帧,还原后面一帧的内容。
本实施方式中,损失函数不局限于上述例举的损失函数类型,根据具体应用场景的不同,损失函数包括(不限于):绝对值损失函数、log对数损失函数、平方损失函数、指数损失函数、Hinge损失函数、感知损失函数、交叉熵损失函数中的一种或者两种以上函数的组成的复合函数。
S2116、当所述特征差值大于预设的损失阈值,基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特征差值小于等于所述损失阈值为止。
通过损失函数计算出样本帧图与训练升帧图像之间的特征差值,大于预设的损失阈值后,需要通过回传函数基于特征差值对第一初始模型和第二初始模型中的权重值进行校正,以使校正后第一初始模型和第二初始模型联合输出的训练升帧图像与样本帧图之间的特征差值趋向于小于等于损失阈值。
通过多次的迭代更新,反复的执行S2112-S2116之间的流程,当损失函数计算出样本帧图与训练升帧图像之间的特征差值,小于等于损失阈值后,完成对样本图集的训练。
对于插帧模型的训练是需要大量的样本图集进行训练的,训练的过程就是采用同的样本图集反复迭代的执行S2111-S2116之间的流程,直至插帧模型达到设定的收敛条件后为止。请参阅图6,图6为本实施例训练插帧模型整流程的流程示意图。
如图6所示,S2116之后包括:
S2120、通过若干所述样本图集对所述插帧模型进行反复迭代的监督训练,直至所述插帧模型符合预设的收敛条件为止;
对于插帧模型的训练需要大量的样本图集进行训练,具体地,使用不同的样本图集反复迭代的执行S2111-S2116之间的流程,每一轮训练都用于校正插帧模型中的权重参数,使插帧模型输出的训练升帧图像越来越逼近样本帧图。
通过反复训练直至插帧模型符合预设的收敛条件为止,本实施方式中,收敛条件为:迭代训练2000000次左右,且通过测试样本测试,模型输出 的准确率达到95%或者更高时,插帧模型就符合了收敛条件。但是,收敛条件的设定不局限于此,根据具体应用场景的不同,在一些实施方式中,迭代训练的次数,以及准确率的设定都能够根据实际需要进行设定。
S2130、确定训练至收敛状态的所述第一初始模型为所述运动向量网络模型,所述第二初始模型为所述帧合成网络模型。
当插帧模型确定训练至收敛状态后,此时,第一初始模型和第二初始模型也处于收敛状态,定义第一初始模型为运动向量网络模型,第二初始模型为帧合成网络模型。
本申请可以通过实现了前述的方法的各个实施例的应用程序在计算机中的运行来构造一个相应的装置,具体请参阅图7,图7为本实施例视频插帧装置基本结构示意图。
如图7所示,一种视频插帧装置,包括:获取模块2100、提取模块2200、处理模块2300和读取模块2400。其中,获取模块2100用于获取待处理的目标视频;提取模块2200用于提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;处理模块2300用于将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第二参考帧图之间的运动向量对所述目标帧图,进行插帧处理的神经网络模型;读取模块2400用于读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。
视频插帧装置当确定需要进行插帧处理的目标视频时,读取目标视频中两个相邻的帧图像作为参考帧图,通过两张参考帧图提取两张帧图之间的运动向量,由于,运动向量能够表征两张参考帧图之间的过渡运动状态,因此,通过运动向量以及两张参考帧图插帧模型就能够生成,介于两张参考图之间的升帧图像。运动向量的引入,能够使升帧图像的图像画面显示两张参考帧图之间的中间状态,使插帧结果更加自然,极大的提高了用户体验。
在一些实施方式中,所述图像标准为帧率阈值,视频插帧装置还包括: 第一获取子模块、第一比对子模块和第一执行子模块。其中,第一获取子模块用于获取待播放视频的帧率数据;第一比对子模块用于将所述帧率数据与所述帧率阈值进行比对;第一执行子模块用于当所述帧率数据表征的帧率值小于所述帧率阈值,确定所述待播放视频为所述目标视频。
在一些实施方式中,所述插帧模型包括运动向量网络模型,视频插帧装置还包括:第一输入子模块、第一生成子模块、第二生成子模块和第一叠加子模块。其中,第一输入子模块用于将所述第一参考帧图和第二参考帧图进行通道图像叠加后生成叠加图像,并将所述叠加图像输入至所述运动向量网络模型中,其中,所述运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型;第一生成子模块用于所述运动向量网络模型通过卷积层对所述叠加图像进行卷积池化处理,生成下采样特征;第二生成子模块用于所述运动向量网络模型通过反卷积层对所述下采样特征进行插值处理,生成上采样特征;第一叠加子模块用于将所述下采样特征和所述上采样特征进行特征融合叠加生成所述运动向量。
在一些实施方式中,所述插帧模型包括帧合成网络模型,视频插帧装置还包括:第三生成子模块、第二输入子模块、第四生成子模块和第五生成子模块。其中,第三生成子模块用于对所述运动向量进行插值处理生成中间帧图;第二输入子模块用于将所述运动向量、中间帧图、第一参考帧图和第二参考帧图合并输入至所述帧合成网络模型中,其中,所述帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;第四生成子模块用于所述帧合成网络模型对所述运动向量、中间帧图、第一参考帧图和第二参考帧图进行卷积处理,生成可见掩码图;第五生成子模块用于所述帧合成网络模型对所述可见掩码图和所述运动向量进行插值处理,生成所述升帧图像。
在一些实施方式中,视频插帧装置还包括:第一处理子模块、第三输入子模块、第一读取子模块、第四输入子模块、第二读取子模块和第二执行子模块。其中,第一处理子模块用于对预先采集的样本视频进行帧化处理生成样本图集,其中,所述样本图集包括:第一训练帧图、第二训练帧图和样本帧图,所述样本帧图位于所述第一训练帧图和第二训练帧图表征 的时间区间内;第三输入子模块用于将所述第一训练帧图和第二训练帧图输入至预设的第一初始模型中,其中,所述第一初始模型为尚未训练至收敛状态,用于提取图像之间运动向量的卷积神经网络模型;第一读取子模块用于读取所述第一初始模型输出的训练运动向量,并根据训练运动向量的插值生成训练中间帧图;第四输入子模块用于将所述第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图输入至预设的第二初始模型中,其中,所述第二初始模型为尚未训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;第二读取子模块用于读取所述第二初始模型输出的训练升帧图像,并根据预设的损失函数计算所述训练升帧图像与所述样本帧图之间的特征差值;第二执行子模块用于当所述特征差值大于预设的损失阈值,基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特征差值小于等于所述损失阈值为止。
在一些实施方式中,视频插帧装置还包括:第二处理子模块和第三执行子模块。其中,第二处理子模块用于通过若干所述样本图集对所述插帧模型进行反复迭代的监督训练,直至所述插帧模型符合预设的收敛条件为止;第三执行子模块用于确定训练至收敛状态的所述第一初始模型为所述运动向量网络模型,所述第二初始模型为所述帧合成网络模型。
在一些实施方式中,视频插帧装置还包括:所述插帧模型包括损失函数,所述损失函数由重构差异函数和运动向量估计还原差异函数加权组成。
为解决上述技术问题,本申请实施例还提供一种计算机设备,用于运行根据所述视频插帧方法所实现的计算机程序。具体请参阅图8,图8为本实施例计算机设备基本结构框图。
如图8所示,计算机设备的内部结构示意图。该计算机设备包括通过系统总线连接的处理器、非易失性存储介质、存储器和网络接口。其中,该计算机设备的非易失性存储介质存储有操作系统、数据库和计算机可读指令,数据库中可存储有控件信息序列,该计算机可读指令被处理器执行时,可使得处理器实现一种视频插帧方法。该计算机设备的处理器用于提 供计算和控制能力,支撑整个计算机设备的运行。该计算机设备的存储器中可存储有计算机可读指令,该计算机可读指令被处理器执行时,可使得处理器执行一种视频插帧方法。该计算机设备的网络接口用于与终端连接通信。本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
本实施方式中处理器用于执行图7中获取模块2100、提取模块2200、处理模块2300和读取模块2400的具体功能,存储器存储有执行上述模块所需的程序代码和各类数据。网络接口用于向用户终端或服务器之间的数据传输。本实施方式中的存储器存储有视频插帧装置中执行所有子模块所需的程序代码及数据,服务器能够调用服务器的程序代码及数据执行所有子模块的功能。
计算机设备当确定需要进行插帧处理的目标视频时,读取目标视频中两个相邻的帧图像作为参考帧图,通过两张参考帧图提取两张帧图之间的运动向量,由于,运动向量能够表征两张参考帧图之间的过渡运动状态,因此,通过运动向量以及两张参考帧图插帧模型就能够生成,介于两张参考图之间的升帧图像。运动向量的引入,能够使升帧图像的图像画面显示两张参考帧图之间的中间状态,使插帧结果更加自然,极大的提高了用户体验。
本申请还提供一种非易失性存储介质,所述的视频插帧方法被编写成计算机程序,以计算机可读指令的形式存储于该存储介质中,计算机可读指令被一个或多个处理器执行时,意味着该程序在计算机中的运行,由此使得一个或多个处理器执行上述任一实施例视频插帧方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体 (Random Access Memory,RAM)等。
应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
本技术领域技术人员可以理解,本申请中已经讨论过的各种操作、方法、流程中的步骤、措施、方案可以被交替、更改、组合或删除。进一步地,具有本申请中已经讨论过的各种操作、方法、流程中的其他步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。进一步地,现有技术中的具有与本申请中公开的各种操作、方法、流程中的步骤、措施、方案也可以被交替、更改、重排、分解、组合或删除。
以上所述仅是本申请的部分实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本申请原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本申请的保护范围。
Claims (10)
- 一种视频插帧方法,其特征在于,包括:获取待处理的目标视频;提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第二参考帧图之间的运动向量对所述目标视频进行插帧处理的神经网络模型;读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。
- 根据权利要求1所述的视频插帧方法,其特征在于,所述图像标准为帧率阈值,所述获取待处理的目标视频包括:获取待播放视频的帧率数据;将所述帧率数据与所述帧率阈值进行比对;当所述帧率数据表征的帧率值小于所述帧率阈值,确定所述待播放视频为所述目标视频。
- 根据权利要求1所述的视频插帧方法,其特征在于,所述插帧模型包括运动向量网络模型,所述将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中包括:将所述第一参考帧图和第二参考帧图进行通道图像叠加后生成叠加图像,并将所述叠加图像输入至所述运动向量网络模型中,其中,所述运动向量网络模型为预先训练至收敛,用于提取图像之间运动向量的卷积神经网络模型;所述运动向量网络模型通过卷积层对所述叠加图像进行卷积池化处理,生成下采样特征;所述运动向量网络模型通过反卷积层对所述下采样特征进行插值处理,生成上采样特征;将所述下采样特征和所述上采样特征进行特征融合叠加生成所述运动 向量。
- 根据权利要求3所述的视频插帧方法,其特征在于,所述插帧模型包括帧合成网络模型,所述将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中包括:对所述运动向量进行插值处理生成中间帧图;将所述运动向量、中间帧图、第一参考帧图和第二参考帧图合并输入至所述帧合成网络模型中,其中,所述帧合成网络模型为预先训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;所述帧合成网络模型对所述运动向量、中间帧图、第一参考帧图和第二参考帧图进行卷积处理,生成可见掩码图;所述帧合成网络模型对所述可见掩码图和所述运动向量进行插值处理,生成所述升帧图像。
- 根据权利要求4所述的视频插帧方法,其特征在于,所述插帧模型的训练方法包括:对预先采集的样本视频进行帧化处理生成样本图集,其中,所述样本图集包括:第一训练帧图、第二训练帧图和样本帧图,所述样本帧图位于所述第一训练帧图和第二训练帧图表征的时间区间内;将所述第一训练帧图和第二训练帧图输入至预设的第一初始模型中,其中,所述第一初始模型为尚未训练至收敛状态,用于提取图像之间运动向量的卷积神经网络模型;读取所述第一初始模型输出的训练运动向量,并根据训练运动向量的插值生成训练中间帧图;将所述第一训练帧图、第二训练帧图、训练运动向量和训练中间帧图输入至预设的第二初始模型中,其中,所述第二初始模型为尚未训练至收敛状态,用于对图像进行插值处理的卷积神经网络模型;读取所述第二初始模型输出的训练升帧图像,并根据预设的损失函数计算所述训练升帧图像与所述样本帧图之间的特征差值;当所述特征差值大于预设的损失阈值,基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特 征差值小于等于所述损失阈值为止。
- 根据权利要求5所述的视频插帧方法,其特征在于,所述基于所述特征差值对所述第一初始模型和所述第二初始模型中的权重值进行反复迭代更新,直至所述特征差值小于等于所述损失阈值为止之后包括:通过若干所述样本图集对所述插帧模型进行反复迭代的监督训练,直至所述插帧模型符合预设的收敛条件为止;确定训练至收敛状态的所述第一初始模型为所述运动向量网络模型,所述第二初始模型为所述帧合成网络模型。
- 根据权利要求1-6任意一项所述的视频插帧方法,其特征在于,所述插帧模型包括损失函数,所述损失函数由重构差异函数和运动向量估计还原差异函数加权组成。
- 一种视频插帧装置,其特征在于,包括:获取模块,用于获取待处理的目标视频;提取模块,用于提取所述目标视频中的第一参考帧图和第二参考帧图,其中,所述第一参考帧图与所述第二参考帧图在时间轴上相邻;处理模块,用于将所述第一参考帧图和第二参考帧图合并输入至预设的插帧模型中,其中,所述插帧模型为预先训练至收敛,用于根据所述第一参考帧图和第二参考帧图之间的运动向量对所述目标帧图,进行插帧处理的神经网络模型;读取模块,用于读取所述插帧模型输出的升帧图像,并将所述升帧图像插入至所述第一参考帧图和第二参考帧图之间。
- 一种计算机设备,其特征在于,包括存储器和处理器,所述存储器中存储有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述处理器执行如权利要求1至7中任一项权利要求所述视频插帧方法的步骤。
- 一种存储有计算机可读指令的存储介质,其特征在于,所述计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行如权利要求1至7中任一项权利要求所述视频插帧方法的步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011603134.4 | 2020-12-29 | ||
CN202011603134.4A CN112804561A (zh) | 2020-12-29 | 2020-12-29 | 视频插帧方法、装置、计算机设备及存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022141819A1 true WO2022141819A1 (zh) | 2022-07-07 |
Family
ID=75804226
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/081990 WO2022141819A1 (zh) | 2020-12-29 | 2021-03-22 | 视频插帧方法、装置、计算机设备及存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN112804561A (zh) |
WO (1) | WO2022141819A1 (zh) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115426525A (zh) * | 2022-09-05 | 2022-12-02 | 北京拙河科技有限公司 | 一种基于高速动帧联动图像拆分方法及装置 |
CN115644804A (zh) * | 2022-09-29 | 2023-01-31 | 浙江浙大西投脑机智能科技有限公司 | 一种基于钙成像恢复算法的双光子成像方法及系统 |
CN115866332A (zh) * | 2022-11-28 | 2023-03-28 | 江汉大学 | 一种视频帧插帧模型的处理方法、装置以及处理设备 |
CN115883869A (zh) * | 2022-11-28 | 2023-03-31 | 江汉大学 | 基于Swin Transformer的视频帧插帧模型的处理方法、装置及处理设备 |
CN117893579A (zh) * | 2024-01-23 | 2024-04-16 | 华院计算技术(上海)股份有限公司 | 人物插帧图像生成方法及装置、计算机可读存储介质、终端 |
Families Citing this family (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113596556B (zh) * | 2021-07-02 | 2023-07-21 | 咪咕互动娱乐有限公司 | 视频传输方法、服务器及存储介质 |
CN114007135B (zh) * | 2021-10-29 | 2023-04-18 | 广州华多网络科技有限公司 | 视频插帧方法及其装置、设备、介质、产品 |
CN114205648B (zh) * | 2021-12-07 | 2024-06-04 | 网易(杭州)网络有限公司 | 插帧方法及装置 |
CN114220175B (zh) * | 2021-12-17 | 2023-04-25 | 广州津虹网络传媒有限公司 | 运动模式识别方法及其装置、设备、介质、产品 |
CN115115964A (zh) * | 2022-01-18 | 2022-09-27 | 长城汽车股份有限公司 | 车载视频稳像方法、装置、车辆及存储介质 |
CN114125403B (zh) * | 2022-01-24 | 2022-06-03 | 广东欧谱曼迪科技有限公司 | 内窥镜显示方法、装置、电子设备及fpga |
CN115348437B (zh) * | 2022-07-29 | 2023-10-31 | 泽景(西安)汽车电子有限责任公司 | 视频处理方法、装置、设备及存储介质 |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2890131A1 (en) * | 2013-12-31 | 2015-07-01 | Patents Factory Ltd. Sp. z o.o. | Video coding with different spatial resolutions for intra-coded frames and inter-coded frames |
CN105517671A (zh) * | 2015-05-25 | 2016-04-20 | 北京大学深圳研究生院 | 一种基于光流法的视频插帧方法及系统 |
CN110070067A (zh) * | 2019-04-29 | 2019-07-30 | 北京金山云网络技术有限公司 | 视频分类方法及其模型的训练方法、装置和电子设备 |
US10382785B2 (en) * | 2011-01-05 | 2019-08-13 | Divx, Llc | Systems and methods of encoding trick play streams for use in adaptive streaming |
CN110324664A (zh) * | 2019-07-11 | 2019-10-11 | 南开大学 | 一种基于神经网络的视频补帧方法及其模型的训练方法 |
CN111898701A (zh) * | 2020-08-13 | 2020-11-06 | 网易(杭州)网络有限公司 | 模型训练、帧图像生成、插帧方法、装置、设备及介质 |
CN112040311A (zh) * | 2020-07-24 | 2020-12-04 | 北京航空航天大学 | 视频图像补帧方法、装置、设备及可存储介质 |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9185339B2 (en) * | 2008-10-24 | 2015-11-10 | Hewlett-Packard Development Company, L.P. | Method and system for increasing frame-display rate |
CN108322685B (zh) * | 2018-01-12 | 2020-09-25 | 广州华多网络科技有限公司 | 视频插帧方法、存储介质以及终端 |
-
2020
- 2020-12-29 CN CN202011603134.4A patent/CN112804561A/zh active Pending
-
2021
- 2021-03-22 WO PCT/CN2021/081990 patent/WO2022141819A1/zh active Application Filing
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10382785B2 (en) * | 2011-01-05 | 2019-08-13 | Divx, Llc | Systems and methods of encoding trick play streams for use in adaptive streaming |
EP2890131A1 (en) * | 2013-12-31 | 2015-07-01 | Patents Factory Ltd. Sp. z o.o. | Video coding with different spatial resolutions for intra-coded frames and inter-coded frames |
CN105517671A (zh) * | 2015-05-25 | 2016-04-20 | 北京大学深圳研究生院 | 一种基于光流法的视频插帧方法及系统 |
CN110070067A (zh) * | 2019-04-29 | 2019-07-30 | 北京金山云网络技术有限公司 | 视频分类方法及其模型的训练方法、装置和电子设备 |
CN110324664A (zh) * | 2019-07-11 | 2019-10-11 | 南开大学 | 一种基于神经网络的视频补帧方法及其模型的训练方法 |
CN112040311A (zh) * | 2020-07-24 | 2020-12-04 | 北京航空航天大学 | 视频图像补帧方法、装置、设备及可存储介质 |
CN111898701A (zh) * | 2020-08-13 | 2020-11-06 | 网易(杭州)网络有限公司 | 模型训练、帧图像生成、插帧方法、装置、设备及介质 |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115426525A (zh) * | 2022-09-05 | 2022-12-02 | 北京拙河科技有限公司 | 一种基于高速动帧联动图像拆分方法及装置 |
CN115426525B (zh) * | 2022-09-05 | 2023-05-26 | 北京拙河科技有限公司 | 一种基于高速动帧联动图像拆分方法及装置 |
CN115644804A (zh) * | 2022-09-29 | 2023-01-31 | 浙江浙大西投脑机智能科技有限公司 | 一种基于钙成像恢复算法的双光子成像方法及系统 |
CN115644804B (zh) * | 2022-09-29 | 2023-08-18 | 浙江浙大西投脑机智能科技有限公司 | 一种基于钙成像恢复算法的双光子成像方法及系统 |
CN115866332A (zh) * | 2022-11-28 | 2023-03-28 | 江汉大学 | 一种视频帧插帧模型的处理方法、装置以及处理设备 |
CN115883869A (zh) * | 2022-11-28 | 2023-03-31 | 江汉大学 | 基于Swin Transformer的视频帧插帧模型的处理方法、装置及处理设备 |
CN115883869B (zh) * | 2022-11-28 | 2024-04-19 | 江汉大学 | 基于Swin Transformer的视频帧插帧模型的处理方法、装置及处理设备 |
CN115866332B (zh) * | 2022-11-28 | 2024-04-19 | 江汉大学 | 一种视频帧插帧模型的处理方法、装置以及处理设备 |
CN117893579A (zh) * | 2024-01-23 | 2024-04-16 | 华院计算技术(上海)股份有限公司 | 人物插帧图像生成方法及装置、计算机可读存储介质、终端 |
Also Published As
Publication number | Publication date |
---|---|
CN112804561A (zh) | 2021-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2022141819A1 (zh) | 视频插帧方法、装置、计算机设备及存储介质 | |
CN110324664B (zh) | 一种基于神经网络的视频补帧方法及其模型的训练方法 | |
WO2022110638A1 (zh) | 人像修复方法、装置、电子设备、存储介质和程序产品 | |
CN111028150B (zh) | 一种快速时空残差注意力视频超分辨率重建方法 | |
US11928753B2 (en) | High fidelity interactive segmentation for video data with deep convolutional tessellations and context aware skip connections | |
US10354394B2 (en) | Dynamic adjustment of frame rate conversion settings | |
WO2023005140A1 (zh) | 视频数据处理方法、装置、设备以及存储介质 | |
CN114007135B (zh) | 视频插帧方法及其装置、设备、介质、产品 | |
KR101702925B1 (ko) | 복수의 영상 프레임의 이미지 패치를 이용한 해상도 스케일링 장치 및 그 방법 | |
CN107920202A (zh) | 基于增强现实的视频处理方法、装置及电子设备 | |
CN112565887B (zh) | 一种视频处理方法、装置、终端及存储介质 | |
CN115115516A (zh) | 基于Raw域的真实世界视频超分辨率算法 | |
CN113902611A (zh) | 图像美颜处理方法、装置、存储介质与电子设备 | |
CN112200817A (zh) | 基于图像的天空区域分割和特效处理方法、装置及设备 | |
CN115294055A (zh) | 图像处理方法、装置、电子设备和可读存储介质 | |
CN114565532A (zh) | 视频美颜处理方法、装置、存储介质与电子设备 | |
US20240205376A1 (en) | Image processing method and apparatus, computer device, and storage medium | |
Zhao et al. | SVCNet: Scribble-based video colorization network with temporal aggregation | |
CN117768774A (zh) | 图像处理器、图像处理方法、拍摄装置和电子设备 | |
WO2021179954A1 (zh) | 视频处理方法、装置、设备及存储介质 | |
WO2024032331A9 (zh) | 图像处理方法及装置、电子设备、存储介质 | |
CN116033183A (zh) | 视频插帧方法及装置 | |
CN115049559A (zh) | 模型训练、人脸图像处理、人脸模型处理方法及装置、电子设备及可读存储介质 | |
CN115049558A (zh) | 模型训练、人脸图像处理方法及装置、电子设备及可读存储介质 | |
WO2022120809A1 (zh) | 虚拟视点绘制、渲染、解码方法及装置、设备、存储介质 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21912589 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21912589 Country of ref document: EP Kind code of ref document: A1 |