CN110996171B

CN110996171B - Training data generation method and device for video tasks and server

Info

Publication number: CN110996171B
Application number: CN201911280328.2A
Authority: CN
Inventors: 鲁方波; 汪贤; 樊鸿飞; 蔡媛
Original assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Current assignee: Beijing Kingsoft Cloud Network Technology Co Ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2021-11-26
Anticipated expiration: 2039-12-12
Also published as: CN110996171A

Abstract

The invention provides a method, a device and a server for generating training data of a video task, wherein the method comprises the following steps: performing frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group comprises a plurality of images; performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence to obtain a degraded video frame sequence; and generating training data of the preset video task through the video frame sequence and the degraded video frame sequence. The method can generate a large amount of training data in a short time by the mode of generating the training data of the video task, and has the advantages of convenient operation, easy realization and lower cost.

Description

Training data generation method and device for video tasks and server

Technical Field

The invention relates to the technical field of image processing, in particular to a method, a device and a server for generating training data of a video task.

Background

The video restoration task mainly comprises a video denoising task, a video super-resolution task, a video deblurring task and the like; the video repair tasks can be realized through a network model obtained through deep learning training. Training of network model needs a large amount of training data, but the training data of the video class that can directly use is comparatively scarce, the mode of collecting training data through the manual work also need consume a large amount of manpower and materials, based on this, can adopt specific image acquisition equipment to gather video data among the correlation technique to obtain a large amount of training data, but this mode is easily influenced by video data acquisition environment, and the acquisition cost that leads to training data is higher, and it is comparatively difficult to gather.

Disclosure of Invention

The invention aims to provide a training data generation method, a device and a server for video tasks, which can generate a large amount of training data in a short time and reduce the cost.

In a first aspect, an embodiment of the present invention provides a method for generating training data of a video-class task, where the method includes: performing frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group comprises a plurality of images; performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence to obtain a degraded video frame sequence; and generating training data of the preset video task through the video frame sequence and the degraded video frame sequence.

With reference to the first aspect, an embodiment of the present invention provides a first possible implementation manner of the first aspect, where the image group includes an original image and a transformed image transformed from the original image; the method comprises the following steps of performing frame interpolation processing based on a preset image group to obtain a video frame sequence, wherein the step comprises the following steps: performing frame interpolation processing between an original image and a transformed image through a video frame interpolation model obtained by pre-training to obtain a video frame sequence; wherein, in the video frame sequence, the image content of at least a part of the video frames changes continuously.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a second possible implementation manner of the first aspect, where the transformed images in the image group include multiple types; each kind of transformation image is obtained by transforming the original image through a transformation operation corresponding to the kind of transformation image; the step of performing frame interpolation processing between the original image and the transformed image includes: for each kind of conversion image, carrying out frame interpolation processing between the original image and the kind of conversion image to obtain a video subsequence corresponding to the kind of conversion image; and splicing the video subsequences corresponding to each transformed image to obtain a video frame sequence.

With reference to the first possible implementation manner of the first aspect, an embodiment of the present invention provides a third possible implementation manner of the first aspect, where the transformed image in the image group includes: the method comprises the steps that a first transformation image obtained by transforming an original image through a preset transformation operation and a second transformation image obtained by transforming through a reverse operation corresponding to the transformation operation are obtained; the step of performing frame interpolation processing between the original image and the transformed image includes: and respectively carrying out frame interpolation between the first transformed image and the original image and between the original image and the second transformed image to obtain a video frame sequence.

With reference to any one of the first aspect to the third possible implementation manner of the first aspect, an embodiment of the present invention provides a fourth possible implementation manner of the first aspect, wherein the image group includes multiple groups; before the step of performing frame interpolation processing based on the preset image group to obtain the video frame sequence, the method further includes: zooming the images in each group of image groups to a preset size; after the step of performing frame interpolation processing based on the preset image group to obtain the video frame sequence, the method further includes: and carrying out splicing processing on the video frame sequence corresponding to each group of image groups to obtain a spliced video frame sequence.

With reference to the first aspect, an embodiment of the present invention provides a fifth possible implementation manner of the first aspect, where the degradation processing includes: one or more of encoding, blurring, and adding noise.

With reference to the first aspect, an embodiment of the present invention provides a sixth possible implementation manner of the first aspect, where the step of generating training data of a preset video-class task through a video frame sequence and a degraded video frame sequence includes: determining the position of a target frame according to a preset sequence; respectively acquiring video frame segments corresponding to the target frame position from the video frame sequence and the degraded video frame sequence, and taking the acquired video frame segments as training data pairs; and determining the obtained training data pair as the training data of the preset video task.

In a second aspect, an embodiment of the present invention further provides a device for generating training data of a video-class task, where the device includes: the frame interpolation module is used for performing frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group comprises a plurality of images; the degradation module is used for performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence and obtain the degraded video frame sequence; and the training data generation module is used for generating training data of the preset video task through the video frame sequence and the degraded video frame sequence.

With reference to the second aspect, the embodiment of the present invention provides a first possible implementation manner of the second aspect, where the image group includes an original image and a transformed image transformed from the original image; the frame interpolation module is further configured to: performing frame interpolation processing between an original image and a transformed image through a video frame interpolation model obtained by pre-training to obtain a video frame sequence; wherein, in the video frame sequence, the image content of at least a part of the video frames changes continuously.

With reference to the second aspect, an embodiment of the present invention provides a second possible implementation manner of the second aspect, where the training data generation module is further configured to: determining the position of a target frame according to a preset sequence; respectively acquiring video frame segments corresponding to the target frame position from the video frame sequence and the degraded video frame sequence, and taking the acquired video frame segments as training data pairs; and determining the obtained training data pair as the training data of the preset video task.

In a third aspect, an embodiment of the present invention further provides a server, which includes a processor and a memory, where the memory stores machine executable instructions capable of being executed by the processor, and the processor executes the machine executable instructions to implement the method for generating training data of the video-class task.

In a fourth aspect, embodiments of the present invention further provide a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the above-mentioned training data generation method for the video-class task.

According to the method, the device and the server for generating the training data of the video task, frame interpolation processing is performed on the basis of a preset image group to obtain a video frame sequence, degradation processing is performed on the video frame sequence to reduce the definition of the video frame sequence, and the degraded video frame sequence is obtained; and generating training data of the preset video task through the video frame sequence and the degraded video frame sequence. The method can generate a large amount of training data in a short time, is convenient and fast to operate, is easy to realize, and has low cost.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the invention as set forth above.

In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a flowchart of a training data generation method for a video-class task according to an embodiment of the present invention;

fig. 2 is a flowchart of another training data generation method for video-class tasks according to an embodiment of the present invention;

fig. 3 is a flowchart of another training data generation method for video-class tasks according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a training data generating apparatus for video-class tasks according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of a server according to an embodiment of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

For ease of understanding, a description of the video-like task is first presented. With the development of computer technology and liquid crystal display technology, high-definition display equipment has been widely applied in actual life, and the promotion of hardware equipment has more demands on high-definition video sources. However, most existing stock videos are low-definition and low-resolution videos shot several years or even several decades ago; in addition, due to the limitation of the shooting environment and the like, a large amount of actually shot videos have the problem that the definition is difficult to meet the requirements of high definition and even ultra-high definition. Therefore, proper repair and processing of videos that do not meet the definition requirements is currently the main approach to low definition video resolution. The video repair includes, but is not limited to, operations such as video denoising processing, video super-resolution processing, video deblurring processing, and the like, and the process of completing the video repair operations may be referred to as a video-like task and may also be referred to as a video repair task.

With the development of computer vision, deep learning is widely applied in various fields, and a deep learning algorithm can be adopted for processing a video repairing task, so that a better effect is generally obtained. Before video processing is performed by adopting a deep learning algorithm, a large amount of training data generally needs to be collected in advance, an initial model established based on the deep learning algorithm is trained by using the training data, and the trained model can be applied to actual video tasks. However, the video-like training samples that are actually available are scarce, and collecting a large amount of video-like training data usually requires a lot of labor and time.

In the related art, a specific image acquisition device may be used to acquire data to obtain a large number of training samples. This approach is costly and is generally limited by the acquisition environment, requiring new training data to be reacquired if the actual video-like task has changed in demand. In the process of acquiring the training data of the image-like task, the high-definition images are copied for multiple copies, each image is subjected to random transformation processing to obtain an image sequence, and then the image sequence is subjected to coding and decoding processing to obtain a training data pair; for training data of video tasks, a stable, low-cost and low-cost data acquisition method does not exist at present.

Based on this, the embodiment of the invention provides a training data generation method for a video-class task, and the technology can be applied to various training data acquisition processes of video-class task models established based on a deep learning algorithm, such as a video denoising task, a video hyper-segmentation task, a video deblurring task and the like.

First, referring to a flowchart of a training data generating method for a video-class task shown in fig. 1, the method includes the following steps:

step S100, performing frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group includes a plurality of images.

The image group generally includes a plurality of images; in order to ensure the sharpness of the generated sequence of video frames, the images in the group of images may be high definition images. When the frame interpolation is performed, any two images in the image group can be selected as target reference frames, and the frame interpolation is performed in a preset frame interpolation mode to obtain an interpolated frame; similarly, in order to ensure the definition and the fluency of the generated video frame sequence, the inserted frame may also be used as a reference frame, and the frame insertion processing is continued until the fluency of the video frame sequence composed of the target reference frame and the inserted frame between the target reference frames meets the preset fluency requirement.

The current video frame insertion mode mainly includes three types: video interpolation based on image interpolation, video interpolation based on optical flow estimation and video interpolation based on deep learning; each video frame interpolation mode has advantages and disadvantages, and an applicable video frame interpolation mode can be selected in the frame interpolation process. Due to the flexibility of the deep learning method, the video frame interpolation method based on the deep learning has a larger development space and is more likely to meet various requirements of video frame interpolation, so the video frame interpolation method based on the deep learning can be usually selected to realize the frame interpolation process.

In addition to transition scenes, video data collected in the real world usually shows the transformation of a scene, such as movement, rotation, or scaling (when changing from one scene to another, the transformation process is also included). In order to be closer to an actual video, the image group may include some high-definition originals and converted images obtained by converting the high-definition originals. Performing frame interpolation between the high-definition original image and the converted image to obtain a gradually changing scene of a video frame sequence; after the video frame sequences are spliced, if two adjacent images are not obtained through frame interpolation processing and have no conversion relation between the two images, the video can be considered to simulate a transition scene.

Step S102, performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence, and obtaining the degraded video frame sequence.

The degradation process may correspond to a video-like task; the video frame sequence obtained through frame interpolation is equivalent to the output data of an initial model established based on deep learning of a video task due to high definition; when the video-like task is a video denoising process, the corresponding degradation process may be to add noise to the video frame sequence, so as to obtain a video frame sequence with lower definition due to higher noise, where the video frame sequence is equivalent to the input data of the initial model. Specifically, the degradation processing may include one or more processing manners of the video, such as encoding the video, compressing the video to obtain a video with a lower resolution, and then adding noise to the compressed video, where the corresponding video task may be a mixed task of video hyper-segmentation and video denoising.

And step S104, generating training data of the preset video task through the video frame sequence and the degraded video frame sequence.

The preset video tasks can be video denoising tasks, video hyper-segmentation tasks, video deblurring tasks and the like, and related mixed tasks. The degradation process is set according to a preset video-class task. Presetting the training data of the video-like task requires selecting a video frame sequence before degradation and a video frame sequence after degradation as a set of training data. When the image group comprises the high-definition original image and a converted image obtained by converting the high-definition original image, a section of video frame data obtained by performing frame interpolation between the high-definition original image and the converted image thereof and a corresponding degraded video frame sequence can be used as a group of training data of an initial model of the video task; or selecting part of continuous video frames in the video frame data and corresponding degraded video frames as training data, and specifically selecting the video frames according to the requirement.

The method for generating the training data of the video task comprises the steps of firstly performing frame interpolation processing based on a preset image group to obtain a video frame sequence, and performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence to obtain the degraded video frame sequence; and generating training data of the preset video task through the video frame sequence and the degraded video frame sequence. The method can generate a large amount of training data in a short time, is convenient and fast to operate, is easy to realize, and has low cost.

The embodiment of the invention also provides another method for generating the training data of the video task, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation process (implemented by the following steps S202 and S204) of performing frame interpolation processing based on a preset image group to obtain a video frame sequence when the image group includes an original image and a transformed image transformed from the original image, as shown in fig. 2, the method includes the following steps:

s200, performing frame interpolation between an original image and a transformed image through a video frame interpolation model obtained through pre-training to obtain a video frame sequence; wherein, in the video frame sequence, the image content of at least a part of the video frames changes continuously.

The video frame interpolation model can be obtained by training an initial model established based on deep learning. The transformed image can be obtained by transforming the original image in one or more of the transformation modes of scaling, rotation, translation and the like; the transformed image is typically the same size as the original image. In the process of obtaining the video frame sequence, the original image and the transformed image are respectively taken as a forward reference frame and a backward reference frame, frame insertion processing is performed between the original image and the transformed image through a video frame insertion model obtained through pre-training to obtain a first insertion frame, and the original image, the first insertion frame and the transformed image are taken as the initial video frame sequence; performing frame interpolation processing based on the initial video frame sequence to obtain a video frame sequence taking an original image as a first video frame and a converted image as a last video frame; the image content in the video frame sequence is continuously changed, and the process of transforming the original image into the transformed image can be shown in the form of video.

Step S202, performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence to obtain a degraded video frame sequence; specifically, the above-described degradation process may include one or more of an encoding process, a blurring process, and an addition of noise. The video frame sequence is encoded, so that the video can be compressed, and the resolution of the video is reduced; the video frame sequence is subjected to fuzzy processing, so that the definition of a video can be reduced; adding noise to a sequence of video frames may also reduce the sharpness of the video.

Step S204, determining the position of a target frame according to a preset sequence; the preset sequence can be determined according to the number of video frames (or the video length) required by the video task during training; if the video frame sequence obtained by frame interpolation contains 100 video frames, and the number of the video frames required by each training is 25, the preset sequence can be set as the first 25 frames, the 26 th to 50 th frames, the 51 th to 75 th frames and the 76 th to 100 th frames, and the positions of the 1 st frame, the 26 th frame, the 51 th frame and the 76 th frame in the video frame sequence are sequentially used as the positions of the target frames.

Step S206, respectively obtaining video frame segments corresponding to the target frame position from the video frame sequence and the degraded video frame sequence, and taking the obtained video frame segments as training data pairs; specifically, a set number of video frames may be selected from the target frame positions of the video frame sequence and the degraded video frame sequence as video frame segments, and as based on the above assumption, when the target frame position is the 26 th frame position, the 26 th to 50 th frames of the video frame sequence that is not degraded and the 26 th to 50 th frames of the degraded video frame sequence are selected as training data pairs.

Step S208, determining the obtained training data pair as the training data of the preset video task; the obtained training data pairs comprise undegraded video frame segments and corresponding degraded video frame segments; the degradation processing of the video frame segment is carried out based on the type of the preset video-class task. And the length of the video frame fragment also meets the requirement of the preset video task on the number of video frames, so that the training data pair can be determined as the training data of the preset video task.

Firstly, performing frame interpolation processing between an original image and a transformed image through a video frame interpolation model obtained by pre-training to obtain a video frame sequence, and performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence and obtain the degraded video frame sequence; and then determining the position of a target frame according to a preset sequence, respectively acquiring a video frame segment corresponding to the position of the target frame from the video frame sequence and the degraded video frame sequence, and taking the acquired video frame segment as a training data pair to obtain training data of a preset video task. In the method, the video frame sequence is generated through the original image and the transformed image, the video frame sequence is degraded, and then the corresponding video frame segment is determined from the video frame sequence and the degraded video frame sequence to be used as the training data of the preset video task, so that the generation efficiency of the training data of the video task is improved, and the cost is low.

The embodiment of the invention also provides another method for generating the training data of the video task, which is realized on the basis of the method of the embodiment; the method mainly describes a specific implementation process (implemented by the following steps S302 and S304) of performing frame interpolation between an original image and a transformed image to obtain a video frame sequence when an image group includes a plurality of groups and the transformed image in each group includes a first transformed image transformed by the original image through a preset transformation operation and a second transformed image transformed by a reverse operation corresponding to the transformation operation, as shown in fig. 3, the method includes the following steps:

step S300, zooming the images in each group of image groups to a preset size; when the image group includes multiple groups, in order to enable smooth switching between two video frame sequences after subsequent video frame sequence splicing, images in each group of image group need to be scaled to a preset size, so that sizes of video frames in the generated video frame sequences are the same.

Step S302, aiming at each group of image groups, frame interpolation processing is respectively carried out between a first transformation image and an original image and between the original image and a second transformation image through a video frame interpolation model obtained by pre-training, so as to obtain a video frame sequence.

The frame insertion process between the first transformed image and the original image and the frame insertion process between the original image and the second transformed image are similar, and a plurality of reference frames can be inserted between the two images, so that the process that the first transformed image is transformed to the original image through the reverse operation corresponding to the transformation of the preset transformation operation, and the original image is transformed to the second transformed image through the reverse operation can be shown. If the first transformed image is obtained by rotating the original image by 90 degrees clockwise, the second transformed image is obtained by rotating the original image by 90 degrees counterclockwise; the sequence of video frames obtained in step S302 may show the transformation process of rotating the first transformed image 90 degrees in the counterclockwise direction to obtain the original image, and then rotating the first transformed image 90 degrees in the counterclockwise direction to obtain the second transformed image.

And step S304, splicing the video frame sequences corresponding to each group of image groups to obtain a spliced video frame sequence.

In the process of the splicing processing, the video frame sequence generated based on each group of image groups can be used as a unit for splicing, or the video frame sequence generated by each group of image groups can be divided into a plurality of sub video frame sequences to be spliced based on the sub video frame sequences; the splicing may be performed in a random order, or may be performed based on a preset splicing order, such as a numbering order of the image groups.

Further, when the transformed images in the image group include a plurality of kinds, each of the transformed images is transformed from the original image by a transformation operation corresponding to the kind of the transformed image; similar to the above process of performing frame interpolation between an original image and a transformed image to obtain a video frame sequence, the method is specifically implemented in the following manner:

(1) and for each kind of conversion image, performing frame interpolation between the original image and the kind of conversion image to obtain a video subsequence corresponding to the kind of conversion image.

(2) Splicing the video subsequences corresponding to each transformed image to obtain a video frame sequence; the splicing process can also adopt a random sequence, or splicing can be carried out based on a preset splicing sequence.

Step S306, performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence, so as to obtain the degraded video frame sequence.

Step S308, generating training data of the preset video task through the video frame sequence and the degraded video frame sequence.

When the image groups comprise a plurality of groups, firstly zooming images in each group of image groups to a preset size, then respectively performing frame interpolation between a first transformed image and an original image and between the original image and a second transformed image by aiming at each group of image groups through a video frame interpolation model obtained by pre-training to obtain a video frame sequence, and performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence and obtain the degraded video frame sequence; and generating training data of the preset video task through the video frame sequence and the degraded video frame sequence. In the method, the video frame sequence is generated by adopting the plurality of image groups, and then the video frame sequence is subjected to degradation processing, so that the training data of the video task is generated, the generation efficiency of the training data of the video task is improved, and the cost is reduced.

The embodiment of the invention also provides another video frame insertion method, which is realized on the basis of the method of the embodiment; the method firstly generates a plurality of image groups: acquiring a plurality of high-definition images, zooming all the images into a fixed size, and performing forward random transformation and reverse random transformation on each image, wherein the transformed images and the original images can be used as an image group; and then, carrying out frame interpolation on the transformed image and the original image by adopting a certain video frame interpolation algorithm, merging the original image, the transformed image and the image subjected to the video frame interpolation into a video sequence, merging the video sequences generated by all the images into a video, carrying out degradation processing on the video by using a certain algorithm, and finally randomly selecting the same frame segment from the merged video and the degraded video to form a required training data pair so as to obtain the required video training data. The method is realized by the following steps:

(1) acquiring a plurality of high-definition images and scaling all the high-definition images to a fixed size. Such as an image scaled to a size of 1024 a wide and 1024 a high.

(2) Carrying out zoom processing on the high-definition image I obtained in the step (1)₀(corresponding to the original image in the previous embodiment) respectively performing forward random transformation and reverse random transformation to obtain an image I₁(equivalent to the first converted image in the foregoing embodiment) and an image I₂(equivalent to the second converted image in the foregoing embodiment). The random transformation includes, but is not limited to, image transformation modes such as image rotation, image translation, image scaling, and the like. A forward random transformation (corresponding to the preset transformation operation in the foregoing embodiment) and an inverse random transformation (corresponding to the inverse operation of the preset transformation operation in the foregoing embodiment) are two opposite transformation operations, for example, the forward random transformation is an image clockwise rotation, and the inverse random transformation is an image counterclockwise rotation; the forward random number transform is an image reduction operation, and the reverse random number transform is an image enlargement operation.

(3) Selecting an image I₁And I₀Respectively serving as a left reference image and a right reference image, and performing n1 times of video frame interpolation prediction (equivalent to the frame interpolation processing in the previous embodiment) on the left reference image and the right reference image by adopting a pre-trained video frame interpolation algorithm (such as SepConv) to obtain an interpolated frame sequence of INTER 1; selecting an image I₀And I₂And respectively serving as a left reference image and a right reference image, and performing n2 times of video frame interpolation prediction on the left reference image and the right reference image by adopting a pre-trained video frame interpolation algorithm to obtain an INTER2 frame interpolation sequence. Will I₁，INTER1，I₀，INTER2，I₂A video sequence V0 (corresponding to the video frame sequence in the previous embodiment) is formed. If multiple random transformation processes are selected in step (1), multiple training data pairs can be generated using a single image.

(4) Transforming each high-definition image in t random transformation modes, and performing frame interpolation on the original image and the transformed image to obtain m × t (where × represents a multiplication relation) high-definition videos (equivalent to the video sub-sequences in the foregoing embodiment); the m × t high definition videos are merged into one video V1 (corresponding to the stitching process in the foregoing embodiment). Performing algorithm degradation processing on the video sequence V1 randomly to obtain a degraded video V2 (which is equivalent to the degraded video frame sequence in the foregoing embodiment); the same frame segment is randomly selected from the video V1 and the video V2 to form a training data pair, and multiple random selections can obtain multiple training data pairs. The algorithm degradation process includes but is not limited to H264 or H265 encoding, gaussian blur, gaussian noise, and scaling noise.

The training data generation method for the video tasks provided by the embodiment of the invention can quickly generate the required video training data based on a plurality of high-definition images, and is simple, efficient and low in cost.

Corresponding to the above embodiment of the training data generation method for video-class tasks, an embodiment of the present invention further provides a training data generation apparatus for video-class tasks, as shown in fig. 4, where the apparatus includes:

the frame interpolation module 400 is configured to perform frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group comprises a plurality of images;

a degrading module 402, configured to perform degrading processing on the video frame sequence to reduce the definition of the video frame sequence, so as to obtain a degraded video frame sequence;

a training data generating module 404, configured to generate training data of a preset video-class task through the video frame sequence and the degraded video frame sequence.

The training data generation device for the video task performs frame interpolation processing based on a preset image group to obtain a video frame sequence, and performs degradation processing on the video frame sequence to reduce the definition of the video frame sequence to obtain a degraded video frame sequence; and generating training data of the preset video task through the video frame sequence and the degraded video frame sequence. The method can generate a large amount of training data in a short time, is convenient and fast to operate, is easy to realize, and has low cost.

In an actual implementation process, the image group may include an original image and a transformed image transformed from the original image; the frame interpolation module is further configured to: performing frame interpolation processing between an original image and a transformed image through a video frame interpolation model obtained by pre-training to obtain a video frame sequence; wherein, in the video frame sequence, the image content of at least a part of the video frames changes continuously.

When the transformed image in the image group includes a plurality of kinds, and each kind of transformed image is transformed from the original image by a transformation operation corresponding to the kind of transformed image; the frame interpolation module is further configured to: for each kind of conversion image, carrying out frame interpolation processing between the original image and the kind of conversion image to obtain a video subsequence corresponding to the kind of conversion image; and splicing the video subsequences corresponding to each transformed image to obtain a video frame sequence.

Further, when the transformed image in the above-mentioned image group includes: when the original image is transformed into a first transformed image by a preset transformation operation and the original image is transformed into a second transformed image by an inverse operation corresponding to the transformation operation, the frame interpolation module is further configured to: and respectively carrying out frame interpolation between the first transformed image and the original image and between the original image and the second transformed image to obtain a video frame sequence.

Further, when the image group comprises a plurality of groups, the device further comprises an image scaling module, which is used for scaling the image in each group of image group to a preset size; the device also comprises a video splicing module used for splicing the video frame sequence corresponding to each group of image groups to obtain a spliced video frame sequence.

Further, the degradation processing includes: one or more of encoding, blurring, and adding noise.

Further, the training data generation module is further configured to: determining the position of a target frame according to a preset sequence; respectively acquiring video frame segments corresponding to the target frame position from the video frame sequence and the degraded video frame sequence, and taking the acquired video frame segments as training data pairs; and determining the obtained training data pair as the training data of the preset video task.

The implementation principle and the generated technical effect of the training data generating device for the video-class task provided by the embodiment of the invention are the same as those of the embodiment of the training data generating method for the video-class task, and for the sake of brief description, corresponding contents in the embodiment of the training data generating method for the video-class task can be referred to where the embodiment of the training data generating device for the video-class task is not mentioned.

An embodiment of the present invention further provides a server, as shown in fig. 5, the server includes a processor 130 and a memory 131, the memory 131 stores machine executable instructions capable of being executed by the processor 130, and the processor 130 executes the machine executable instructions to implement the training data generating method for the video-class task.

Further, the server shown in fig. 5 further includes a bus 132 and a communication interface 133, and the processor 130, the communication interface 133 and the memory 131 are connected through the bus 132.

The Memory 131 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. The communication connection between the network element of the system and at least one other network element is realized through at least one communication interface 133 (which may be wired or wireless), and the internet, a wide area network, a local network, a metropolitan area network, and the like can be used. The bus 132 may be an ISA bus, PCI bus, EISA bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one double-headed arrow is shown in FIG. 5, but this does not indicate only one bus or one type of bus.

The processor 130 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 130. The Processor 130 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 131, and the processor 130 reads the information in the memory 131 and completes the steps of the method of the foregoing embodiment in combination with the hardware thereof.

The embodiment of the present invention further provides a machine-readable storage medium, where the machine-readable storage medium stores machine-executable instructions, and when the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the method for generating training data of the video-class task.

The training data generation method, the training data generation device, and the computer program product of the server for the video task provided in the embodiments of the present invention include a computer-readable storage medium storing a program code, where instructions included in the program code may be used to execute the method described in the foregoing method embodiments, and specific implementations may refer to the method embodiments and are not described herein again.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present invention, which are used for illustrating the technical solutions of the present invention and not for limiting the same, and the protection scope of the present invention is not limited thereto, although the present invention is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the embodiments of the present invention, and they should be construed as being included therein. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for generating training data of a video-class task, the method comprising:

performing frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group comprises a plurality of images;

performing degradation processing on the video frame sequence to reduce the definition of the video frame sequence to obtain the degraded video frame sequence;

generating training data of a preset video task through the video frame sequence and the degraded video frame sequence;

the image group comprises an original image and a transformed image obtained by transforming the original image;

the method comprises the following steps of performing frame interpolation processing based on a preset image group to obtain a video frame sequence, wherein the step comprises the following steps:

performing frame interpolation between the original image and the transformed image in a preset video frame interpolation mode to obtain a video frame sequence;

generating training data of a preset video task according to the video frame sequence and the degraded video frame sequence, wherein the step comprises the following steps:

determining the position of a target frame according to a preset sequence;

respectively acquiring a video frame segment corresponding to the position of the target frame from the video frame sequence and the degraded video frame sequence, and taking the acquired video frame segment as a training data pair;

and determining the obtained training data pair as the training data of the preset video task.

2. The method of claim 1,

performing frame interpolation between the original image and the transformed image through a video frame interpolation model obtained by pre-training to obtain a video frame sequence; wherein, in the video frame sequence, the image content of at least a part of the video frames changes continuously.

3. The method of claim 2, wherein the transformed images in the set of images comprise a plurality; each kind of the transformation image is obtained by transforming the original image through a transformation operation corresponding to the kind of the transformation image;

a step of performing frame interpolation processing between the original image and the converted image, including:

for each kind of the transformation image, performing frame interpolation between the original image and the kind of the transformation image to obtain a video subsequence corresponding to the kind of the transformation image;

and splicing the video subsequences corresponding to each transformed image to obtain a video frame sequence.

4. The method of claim 2, wherein transforming images in the set of images comprises: a first transformation image obtained by transforming the original image through a preset transformation operation and a second transformation image obtained by transforming through a reverse operation corresponding to the transformation operation;

a step of performing frame interpolation processing between the original image and the converted image, including: and respectively carrying out frame interpolation processing between the first transformed image and the original image and between the original image and the second transformed image to obtain a video frame sequence.

5. The method of any of claims 1-4, wherein the set of images comprises a plurality of sets;

before the step of performing frame interpolation processing based on the preset image group to obtain the video frame sequence, the method further includes: zooming the images in each group of the images into a preset size;

after the step of performing frame interpolation processing based on the preset image group to obtain a video frame sequence, the method further includes: and splicing the video frame sequences corresponding to each group of image groups to obtain the spliced video frame sequences.

6. The method of claim 1, wherein the degradation process comprises: one or more of encoding, blurring, and adding noise.

7. An apparatus for generating training data for a video-like task, the apparatus comprising:

the frame interpolation module is used for performing frame interpolation processing based on a preset image group to obtain a video frame sequence; wherein the image group comprises a plurality of images;

a degradation module, configured to perform degradation processing on the video frame sequence to reduce the definition of the video frame sequence, so as to obtain a degraded video frame sequence;

the training data generation module is used for generating training data of a preset video task through the video frame sequence and the degraded video frame sequence;

the frame insertion module is further configured to: performing frame interpolation between the original image and the transformed image in a preset video frame interpolation mode to obtain a video frame sequence;

the training data generation module is further configured to:

determining the position of a target frame according to a preset sequence;

8. The apparatus of claim 7,

the frame insertion module is further configured to:

9. A server, comprising a processor and a memory, the memory storing machine executable instructions executable by the processor, the processor executing the machine executable instructions to implement the method of generating training data for a video-like task according to any one of claims 1 to 6.

10. A machine-readable storage medium having stored thereon machine-executable instructions which, when invoked and executed by a processor, cause the processor to carry out a method of generating training data for a video-like task according to any one of claims 1 to 6.