CN113099146B

CN113099146B - Video generation method and device and related equipment

Info

Publication number: CN113099146B
Application number: CN201911320245.1A
Authority: CN
Inventors: 朱聪超; 罗巍; 王强; 邓斌
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-12-19
Filing date: 2019-12-19
Publication date: 2022-12-06
Anticipated expiration: 2039-12-19
Also published as: CN113099146A

Abstract

The embodiment of the invention discloses a video generation method, a video generation device and related equipment, which can be particularly applied to a camera, a smart phone and the like to improve the video quality, wherein the method comprises the following steps: acquiring first video data acquired by a first camera in a first time period, wherein the first video data comprises a plurality of frames of first video frames; acquiring image data acquired by a second camera in the first time period, wherein the image data comprises one or more images; adjusting the resolution of each first video frame in the plurality of first video frames to obtain second video data; and performing image fusion on one or more second video frames in the plurality of second video frames based on the image data to obtain third video data. The method and the device can be applied to a plurality of technical fields such as intelligent video processing and the like, and can improve the resolution of the video more intelligently and more accurately.

Description

Video generation method and device and related equipment

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video generation method, an apparatus, and a related device.

Background

With the rapid development of 5G technology and the advent of the Internet of Things (IOT) large screen era. People have higher and higher requirements on the definition of video pictures, and the definition of the video pictures is often no longer satisfied with the definition of the common 720P or 1080P, but the video pictures with 4K high definition or even 8K ultra-high definition are pursued.

However, in the prior art, it is often difficult to simultaneously achieve high frame rate and high resolution video due to constraints of bus transmission capability, data processing capability, and the like of a Sensor (Sensor) and an Image Signal Processor (ISP), that is, the generated video may not satisfy high fluency and high definition at the same time. Currently, the recording function of the mobile phone can support the recording of 30fps (video frame rate of 30 frames per second) and 4K image quality, or the recording of 240fps (video frame rate of 240 frames per second) and 720P image quality. Obviously, if the frame rate of the recording is to be increased, the resolution of the recording is to be decreased, and conversely, if the resolution of the recording is to be increased, the frame rate of the recording is to be decreased. Thus, the user's requirements for high frame rate and high resolution video recording cannot be satisfied at the same time.

Disclosure of Invention

The embodiment of the invention provides a video generation method, a video generation device and related equipment, which are used for processing acquired original video data more intelligently, more accurately and efficiently, improving the video quality and meeting the actual requirements of users.

In a first aspect, an embodiment of the present invention provides a video generation method, which may include: acquiring first video data acquired by a first camera in a first time period, wherein the first video data comprises a plurality of frames of first video frames, and the resolution of each frame of the plurality of frames of first video frames is a first resolution; acquiring image data acquired by a second camera in the first time period, wherein the image data comprises one or more images, and the resolution of each image in the one or more images is a second resolution; adjusting the resolution of each first video frame in the multiple first video frames to obtain second video data, wherein the second video data comprises multiple second video frames, and the resolution of each second video frame in the multiple second video frames is the second resolution; and performing image fusion on one or more second video frames in the plurality of second video frames based on the image data to obtain third video data, wherein the third video data comprises a plurality of third video frames.

In the embodiment of the present invention, first video data (for example, including multiple frames of first video frames) and image data (for example, including one or more images) may be respectively acquired by two cameras in the same time period and the same shooting field, and then, according to the resolutions of the multiple images in the image data, the resolution of multiple frames of video frames in the first video data may be adjusted to obtain second video data (for example, including multiple frames of second video frames), where the resolution of multiple frames of second video frames in the second video data is consistent with the resolutions of the multiple images; finally, based on the multiple images, image fusion is performed on one or more second video frames in the multiple second video frames, so as to further improve the video quality, and obtain third video data (for example, including multiple third video frames). The embodiment of the invention can be used in various daily video recording scenes, and the original video data and the image data synchronously acquired by the camera or the mobile phone are processed, so that the video data with high resolution and high frame rate are obtained, the video quality of the video of the terminal equipment such as the camera or the mobile phone is improved, and the actual requirements of users are met. Optionally, in the present application, the resolution of the image may be greater than the resolution of the first video frame, and may also be less than or equal to the resolution of the first video frame, that is, according to the acquired first video data and the image data, high-resolution and high-quality video data or low-resolution and light-weight video data that occupies a small memory and meets the actual needs of the user may be obtained. In addition, in some possible embodiments, the video data and the image data can be respectively collected by three cameras or more cameras, and the video data is processed based on the image data, so that the video quality is improved.

In one possible implementation, the second resolution is greater than the first resolution; the adjusting the resolution of each first video frame of the plurality of first video frames to obtain second video data includes: and performing super-resolution reconstruction on each frame of the plurality of frames of the first video frames based on a first model obtained by pre-training to obtain the second video data.

In the embodiment of the present invention, according to the resolution of the high-definition image (for example, the second resolution), based on the pre-trained model, each frame of low-definition video frame is reconstructed by super-resolution to obtain the high-definition video frame (that is, to obtain the high-definition video data) that is consistent with the resolution of the high-definition image, through the low-definition video data (for example, including multiple frames of low-definition video frames) collected by the first camera and the high-definition image data (for example, including multiple high-definition images) collected by the second camera. The method can improve the definition of video pictures of video recorded by equipment such as a camera, a mobile phone and the like more efficiently and accurately, improve the video quality and meet the requirements of users.

In one possible implementation, the method further includes: acquiring a training sample, wherein the training sample comprises an original image set and a target image set; the original image set comprises N original images, the resolution of each original image in the N original images is the first resolution, the target image set comprises N target images, the resolution of each target image in the N target images is the second resolution, the N original images correspond to the N target images in a one-to-one mode, and N is an integer greater than or equal to 1; and training to obtain the first model by taking the N original images and the N target images as training input and taking the target images corresponding to the N original images as N labels.

In the embodiment of the invention, a more reasonable model meeting the actual requirements can be obtained through continuous deep learning and training based on input training samples (for example, comprising a large number of low-definition images and high-definition images corresponding to the low-definition images). The model obtained through the training can more intelligently, efficiently and accurately carry out resolution processing (for example, super-resolution reconstruction) on each frame of the first video frame, the resolution of each frame of the video frame is improved, a video picture is clearer, and the video quality is improved.

In one possible implementation, the first model is a convolutional neural network model; the super-resolution reconstruction of each first video frame in the plurality of first video frames based on the first model obtained by pre-training to obtain the second video data comprises: and based on the convolutional neural network model, performing super-resolution reconstruction on each first video frame in the plurality of first video frames to obtain second video frames corresponding to the first video frames, wherein the second video data comprises the second video frames corresponding to the first video frames.

In the embodiment of the invention, the super-resolution reconstruction can be carried out on the first video frame of each frame based on the convolutional neural network model obtained by training. For example, through a series of convolution operations, the second video frame corresponding to the first video frame of each frame is obtained, that is, the video frame with higher resolution is obtained, so that the video picture is clearer, the video quality is higher, and the user impression is improved. The super-resolution reconstruction method based on the convolutional neural network is simple in steps and high in efficiency.

In a possible implementation manner, before the performing image fusion on one or more second video frames of the plurality of second video frames based on the image data to obtain third video data, the method further includes: respectively carrying out dynamic area detection on the ith frame of the second video frames and the jth image of the one or more images, and determining the dynamic area and the static area in the ith frame of the second video frames and the jth image, wherein i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

In the embodiment of the present invention, since the frame rates at which the first camera and the second camera respectively capture the first video data and the image data are different (for example, the first camera captures the first video data at a frame rate of 60fps, that is, 60 frames per second of the first video frame, and the second camera captures the image data at a frame rate of 10fps, that is, captures 10 images per second), each second video frame does not necessarily correspond to an image from the same time as the first video frame (for example, the j-th image may be an image closest to the i-th frame of the second video frame at the capture time), so there may be a consistent region between the two, and there may also be an inconsistent region due to hand shake during capturing or object motion in the capturing field of view, etc. By detecting the dynamic areas of the second video frame of the ith frame and the jth image respectively, the dynamic areas and the static areas in the second video frame of the ith frame and the jth image can be determined, so that the dynamic areas and the static areas in the second video frame of each frame and each image can be determined. For dynamic regions, the information in the second video frame is more in line with the actual situation, and for static regions, the details in the image are clearer and richer. Therefore, the second video frame can be better and more pertinently subjected to image fusion based on the difference between the dynamic area and the static area, so that the video picture accords with the actual situation, the details are clear and rich, and the video quality is improved.

In a possible implementation manner, in the respective acquisition time corresponding to the one or more images, the acquisition time corresponding to the jth image is the acquisition time with the smallest time difference with the acquisition time corresponding to the ith frame of the second video frame.

In the embodiment of the present invention, since the frame rates of the first video data and the image data respectively acquired by the first camera and the second camera are different (for example, the first camera acquires the first video data at a frame rate of 60fps, that is, 60 frames per second of the first video data, and the second camera acquires the image data at a frame rate of 10fps, that is, 10 images per second of the first video data are captured), any one of the second video frames of each frame does not necessarily correspond to the image from the same time as the second video frame. Therefore, there may be a region of coincidence between the two, or a region of non-coincidence due to shaking hands at the time of photographing or movement of an object in a photographing field, or the like. Therefore, for example, when performing image fusion on the second video frame of the i-th frame, image fusion should be performed with reference to an image having the largest number of regions that coincide with the second video frame of the i-th frame (for example, the j-th image closest to the second video frame of the i-th frame at the time of capture). Therefore, the accuracy and the rationality of image fusion are improved, and the video quality is improved.

In a possible implementation manner, the performing, based on the image data, image fusion on one or more second video frames of the plurality of second video frames to obtain third video data includes: and if the image high-frequency information of the static area in the jth image is greater than the image high-frequency information of the static area in the ith second video frame, replacing the image high-frequency information of the static area in the ith second video frame with the image high-frequency information of the static area in the jth image to obtain the third video data, wherein the information of the dynamic area in the third video frame corresponding to the ith second video frame is the information of the dynamic area in the ith second video frame.

In the embodiment of the present invention, since the second video frame of the ith frame and the jth image do not necessarily come from the same time, there may be a consistent region between the two, and there may also be a inconsistent region due to shaking hands during shooting or object movement in the shooting field. For the dynamic area, the information in the second video frame of the ith frame better accords with the actual condition, and for the static area, the details in the jth image are clearer and richer. Thus, information of the dynamic region in the ith frame video frame can be retained. If the image high frequency information (for example, the edge and the detail of the image) of the static area in the jth image is greater than that of the static area in the second video frame of the ith frame, the image high frequency information of the static area in the second video frame of the ith frame can be replaced by the image high frequency information of the static area in the jth image. Therefore, the video picture not only accords with the actual situation, but also has more clear and rich details, and the video quality is further improved.

In one possible implementation, the second resolution is less than or equal to the first resolution; the method further comprises the following steps: acquiring a training sample, wherein the training sample comprises an original image set and a target image set; the original image set comprises N original images, the resolution of each original image in the N original images is the first resolution, the target image set comprises N target images, the resolution of each target image in the N target images is the second resolution, the N original images and the N target images are in one-to-one correspondence, and N is an integer greater than or equal to 1; and training to obtain a second model by taking the N original images and the N target images as training input and taking the target images corresponding to the N original images as N labels.

In the embodiment of the present invention, the resolution of the image acquired by the second camera may be less than or equal to the resolution of the first video frame acquired by the first camera. A more reasonable model meeting the actual requirement can be obtained through continuous deep learning and training based on an input training sample (for example, a large number of high-definition images and low-definition images corresponding to the high-definition images in a one-to-one mode). The model obtained through the training can more intelligently, efficiently and accurately carry out resolution processing (such as resolution compression) on the first video frame of each frame, and the resolution of each frame of video frame is reduced, so that the video data is small in size, small in occupied memory and convenient to share.

In one possible implementation, the second resolution is less than or equal to the first resolution; the adjusting the resolution of each first video frame of the multiple first video frames to obtain the second video data includes: and compressing each frame of first video frames in the multiple frames of first video frames based on the second model obtained by pre-training to obtain the second video data.

In the embodiment of the present invention, the resolution of the image acquired by the second camera may be less than or equal to the resolution of the first video frame acquired by the first camera. Therefore, according to the resolution (for example, the second resolution) of the low-definition image, each frame of high-definition video frame is compressed and the like based on the pre-trained model, so as to obtain the low-definition video frame (that is, obtain the low-definition video data) consistent with the resolution of the low-definition image. The method can more efficiently and accurately obtain the video data which has low resolution, light weight and small occupied memory and meets the actual requirements of users.

In one possible implementation, an aspect ratio of each of the plurality of first video frames is consistent with an aspect ratio of each of the one or more images.

In the embodiment of the invention, the first video data and the image data are respectively acquired by the two cameras. Therefore, in order to make the image data more valuable for reference, the information contained in the image data should be as close as possible to the information contained in the first video data. Therefore, the aspect ratio of each frame of video should be consistent with the aspect ratio of each image, so that a series of operations such as resolution adjustment and image fusion can be performed on the first video data more reasonably, accurately and efficiently, and high-quality video data (e.g. high-resolution video data) meeting the user requirements can be obtained.

In a second aspect, an embodiment of the present invention provides a terminal device, including a processor, a memory, a display screen, a first camera, and a second camera; the memory, the display screen, the first camera, and the second camera are coupled to the processor, the memory configured to store computer program code, the computer program code including computer instructions, the processor invoking the computer instructions to cause the terminal device to perform:

displaying a shooting interface on the display screen, wherein the shooting interface comprises a shooting control;

in response to a shooting instruction detected by the shooting control, calling the first camera and the second camera to respectively collect first video data and image data in a first time period, wherein the first video data comprises multiple frames of first video frames, the resolution of each frame of the multiple frames of first video frames is a first resolution, the image data comprises one or more images, and the resolution of each image of the one or more images is a second resolution;

adjusting the resolution of each first video frame in the multiple first video frames to obtain second video data, wherein the second video data comprises multiple second video frames, and the resolution of each second video frame in the multiple second video frames is the second resolution;

based on the image data, carrying out image fusion on one or more frames of second video frames in the multiple frames of second video frames to obtain third video data, wherein the third video data comprises multiple frames of third video frames;

displaying the third video data on the display screen.

In a third aspect, an embodiment of the present invention provides a video generating apparatus, which may include:

the first acquisition unit is used for acquiring first video data acquired by a first camera in a first time period, wherein the first video data comprises a plurality of frames of first video frames, and the resolution of each frame of the plurality of frames of first video frames is a first resolution;

the second acquisition unit is used for acquiring image data acquired by a second camera in the first time period, wherein the image data comprises one or more images, and the resolution of each image in the one or more images is a second resolution;

the adjusting unit is used for adjusting the resolution of each first video frame in the multiple first video frames to obtain second video data, wherein the second video data comprises multiple second video frames, and the resolution of each second video frame in the multiple second video frames is the second resolution;

and the image fusion unit is used for carrying out image fusion on one or more second video frames in the multiple second video frames based on the image data to obtain third video data, wherein the third video data comprises multiple third video frames.

In one possible implementation, the second resolution is greater than the first resolution; the adjusting unit is specifically configured to:

and performing super-resolution reconstruction on each frame of the plurality of frames of the first video frames based on a first model obtained by pre-training to obtain the second video data.

In one possible implementation, the apparatus further includes:

a third obtaining unit, configured to obtain a training sample, where the training sample includes an original image set and a target image set; the original image set comprises N original images, the resolution of each original image in the N original images is the first resolution, the target image set comprises N target images, the resolution of each target image in the N target images is the second resolution, the N original images correspond to the N target images in a one-to-one mode, and N is an integer greater than or equal to 1;

and the training unit is used for training to obtain the first model by taking the N original images and the N target images as training input and taking the target images corresponding to the N original images as N labels.

In one possible implementation, the first model is a convolutional neural network model; the adjusting unit is further specifically configured to:

and based on the convolutional neural network model, performing super-resolution reconstruction on each first video frame in the plurality of first video frames to obtain second video frames corresponding to the first video frames, wherein the second video data comprises the second video frames corresponding to the first video frames.

In one possible implementation, the apparatus further includes:

a determining unit, configured to perform dynamic region detection on an ith frame of the multiple frames of second video frames and a jth image of the one or more images, respectively, and determine a dynamic region and a static region in the ith frame of second video frames and the jth image, where i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

In a possible implementation manner, in the acquisition time corresponding to each of the one or more images, the acquisition time corresponding to the jth image is the acquisition time with the smallest time difference with the acquisition time corresponding to the ith frame of the second video frame.

In a possible implementation manner, the image fusion unit is specifically configured to:

and if the image high-frequency information of the static area in the jth image is greater than the image high-frequency information of the static area in the ith second video frame, replacing the image high-frequency information of the static area in the ith second video frame with the image high-frequency information of the static area in the jth image to obtain the third video data, wherein the information of the dynamic area in the third video frame corresponding to the ith second video frame is the information of the dynamic area in the ith second video frame.

In one possible implementation, the second resolution is less than or equal to the first resolution; the adjusting unit is specifically configured to:

and performing resolution compression on each first video frame in the multiple first video frames based on a second model obtained by pre-training to obtain the second video data.

In one possible implementation, an aspect ratio of each of the plurality of frames of the first video frame is consistent with an aspect ratio of each of the one or more images.

In a fourth aspect, an embodiment of the present invention provides a terminal device, which may include: a processor, a first camera and a second camera coupled to the processor:

the first camera is used for collecting first video data in a first time period;

the second camera is used for collecting image data in the first time period;

the processor is configured to:

acquiring the first video data, wherein the first video data comprises a plurality of frames of first video frames, and the resolution of each frame of the plurality of frames of first video frames is a first resolution;

acquiring the image data, wherein the image data comprises one or more images, and the resolution of each image in the one or more images is a second resolution;

and performing image fusion on one or more second video frames in the plurality of second video frames based on the image data to obtain third video data, wherein the third video data comprises a plurality of third video frames.

In one possible implementation, the second resolution is greater than the first resolution; the processor is specifically configured to:

and performing super-resolution reconstruction on each first video frame in the plurality of first video frames based on a first model obtained by pre-training to obtain the second video data.

In one possible implementation, the second resolution is greater than the first resolution; the processor is further configured to:

acquiring a training sample, wherein the training sample comprises an original image set and a target image set; the original image set comprises N original images, the resolution of each original image in the N original images is the first resolution, the target image set comprises N target images, the resolution of each target image in the N target images is the second resolution, the N original images and the N target images are in one-to-one correspondence, and N is an integer greater than or equal to 1;

and training to obtain the first model by taking the N original images and the N target images as training input and taking the target images corresponding to the N original images as N labels.

In one possible implementation, the first model is a convolutional neural network model; the processor is specifically configured to:

In a possible implementation manner, before the performing, based on the image data, image fusion on one or more second video frames of the multiple second video frames to obtain third video data, the processor is further configured to:

respectively carrying out dynamic area detection on the ith frame of the second video frames and the jth image of the one or more images, and determining the dynamic area and the static area in the ith frame of the second video frames and the jth image, wherein i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

In one possible implementation, the processor is specifically configured to:

In one possible implementation, the second resolution is less than or equal to the first resolution; the processor is specifically configured to:

and performing resolution compression on each frame of the multiple frames of first video frames based on a second model obtained by pre-training to obtain the second video data.

In a possible implementation manner, the terminal device may further include a display;

the display is used for displaying the third video data.

In a fifth aspect, the present application provides a video generating apparatus, where the video generating apparatus includes a processor, and the processor is configured to support corresponding functions in any one of the video generating methods provided in the first aspect. The video generating device may also include a memory, coupled to the processor, that stores program instructions and data necessary for the video generating device. The video generating apparatus may further include a communication interface for the video generating apparatus to communicate with other devices or a communication network.

In a sixth aspect, the present application provides a terminal device, where the terminal device includes a processor, and the processor is configured to support the terminal device to execute a corresponding function in any one of the video generation methods provided in the first aspect. The terminal device may also include a memory, coupled to the processor, that stores program instructions and data necessary for the terminal device. The terminal device may also include a communication interface for the terminal device to communicate with other devices or a communication network.

In a seventh aspect, the present application provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the flow of the video generation method in any one of the above first aspects.

In an eighth aspect, an embodiment of the present invention provides a computer program product, where the computer program includes instructions, and when the computer program is executed by a computer, the computer may execute the video generation method flow described in any one of the above first aspects.

In a ninth aspect, the present application provides a chip system, where the chip system includes a processor, and is configured to implement the functions related to the video generation method flow in any one of the above first aspects. In one possible design, the system-on-chip further includes a memory for storing program instructions and data necessary for the video generation method. The chip system may be constituted by a chip, or may include a chip and other discrete devices.

Drawings

Fig. 1 is a schematic defect diagram of a video frame interpolation algorithm according to an embodiment of the present invention;

fig. 2 is a functional block diagram of a terminal device according to an embodiment of the present invention;

fig. 3 is a block diagram of a software structure of a terminal device according to an embodiment of the present invention;

FIGS. 4 a-4 c are schematic diagrams of a set of interfaces involved in the prior art provided by embodiments of the present invention;

fig. 5 is a schematic view of an application scenario of a video generation method according to an embodiment of the present invention;

FIGS. 6 a-6 d are schematic diagrams of a set of interfaces provided by an embodiment of the present invention;

fig. 7 is a schematic application scenario diagram of another video generation method according to an embodiment of the present invention;

fig. 8 is a schematic flowchart of a video generation method according to an embodiment of the present invention;

fig. 9 is a schematic diagram of a dual camera provided by an embodiment of the present invention;

FIGS. 10 a-10 c are detailed views of image high frequency fusion provided by an embodiment of the present invention;

11 a-11 b are overall steps of a video generation method provided by the embodiment of the invention;

fig. 12 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present invention;

fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

The embodiments of the present invention will be described below with reference to the drawings.

The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus. The size of the pictures included in the drawings of the present application is for convenience of explanation, and does not represent the actual size and the size relationship between the pictures.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.

As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a processor and the processor can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with one another at a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).

First, some terms in the present application are explained to facilitate understanding by those skilled in the art.

(1) Super-Resolution (SR) is a process of improving the Resolution of an original image by a hardware or software method, and reconstructing a corresponding high-Resolution image from a single or a series of low-Resolution images is Super-Resolution reconstruction. Among them, the SR based on deep learning is mainly based on Super-Resolution (SISR) of a Single low-Resolution Image. The super-resolution technology has important application value in the fields of monitoring equipment, satellite images, medical images and the like.

(2) Convolutional Neural Network (CNN), a feedforward Neural Network, whose artificial neurons can respond to a portion of the coverage of surrounding cells, performs well for large image processing. It includes a convolutional layer (convolutional layer) and a pooling layer (Pooling layer). CNN is mainly used for identifying two-dimensional graphs of displacement, scaling and other forms of distortion invariance, and part of functions are mainly realized by a pooling layer. Since the feature detection layer of CNN learns from the training data, explicit feature extraction is avoided when CNN is used, while learning from the training data is implicit; moreover, because the weights of the neurons on the same feature mapping surface are the same, the network can learn in parallel, which is also a great advantage of the convolutional network relative to the network in which the neurons are connected with each other. The convolution neural network has unique superiority in the aspects of voice recognition and image processing by virtue of a special structure with shared local weight, the layout of the convolution neural network is closer to that of an actual biological neural network, the complexity of the network is reduced by virtue of weight sharing, and particularly, the complexity of data reconstruction in the processes of feature extraction and classification is avoided by virtue of the characteristic that an image of a multi-dimensional input vector can be directly input into the network. Super-Resolution technology (SRCNN) based on convolutional neural network is also an important content of deep learning used in Super-Resolution reconstruction.

First, in order to facilitate understanding of the embodiments of the present invention, technical problems to be specifically solved by the present application are further analyzed and presented. In the prior art, as for a technique for improving a video frame rate and a resolution, various schemes are included, and the following two schemes that are commonly used are exemplified below. Wherein the content of the first and second substances,

the first scheme is as follows: by reducing the resolution of the Sensor (Sensor) output, the frame rate of the video is increased. In the prior art, due to the constraints of bus transmission capability, data processing capability and the like of a sensor and an Image Signal Processor (ISP), a video recorded in the prior art is often difficult to satisfy both high frame rate and high resolution. According to the first scheme, the resolution ratio and the frame rate can be coordinated according to actual conditions and requirements, and the frame rate can be effectively improved while the resolution ratio is guaranteed to a certain degree.

The first scheme has the following disadvantages: this solution is a compromise to the prior art that is limited by the bus transmission capabilities and data processing capabilities of the sensor and image model processor. When the video frame rate is increased, the resolution of the video must be sacrificed, which results in low video definition and loss of details in the video.

Scheme II: and improving the frame rate of the video through a video frame interpolation algorithm. The video frame interpolation algorithm can estimate the motion trail of the target object and the like by utilizing the original relation between the previous frame and the next frame to generate more intermediate frames so as to improve the frame rate of the video. For example, for super slow motion recording of 960fps, the frame rate of the video can be greatly improved, and the video is smoother.

The second scheme has the following defects: the second scheme is generally applicable to a smaller resolution, and often causes a defect in a video picture (for example, as shown in fig. 1, a picture defect that rotating fan blades are broken due to video frame insertion) due to inaccurate estimation of a motion trajectory of a target object in a video, which seriously affects the video quality and reduces the impression of a user.

In summary, in both of the above two schemes, the resolution and the frame rate of the video recorded by the terminal device cannot be improved by using the existing terminal device such as a general video camera or a mobile phone. Therefore, in order to solve the problem that the actual service requirement is not met in the current video processing related technology, the technical problem to be actually solved by the present application includes the following aspects: based on original video data acquired by terminal equipment such as an existing camera or a mobile phone, the resolution of the original video data is efficiently and accurately improved, and finally video data with high frame rate and high resolution is obtained.

Referring to fig. 2, fig. 2 is a functional block diagram of a terminal device according to an embodiment of the present invention. Alternatively, in one embodiment, the terminal device 100 may be configured in a fully or partially automatic shooting mode. For example, the terminal device 100 may be in a timed continuous automatic shooting mode, or an automatic shooting mode in which shooting is performed when a target object (e.g., a human face) set in advance is detected within a shooting range according to a computer instruction, or the like. When the terminal device 100 is in the automatic shooting mode, the terminal device 100 may be set to operate without interaction with a person.

The following specifically describes the embodiment by taking the terminal device 100 as an example. It should be understood that terminal device 100 may have more or fewer components than shown, may combine two or more components, or may have a different configuration of components. The various components shown in the figures may be implemented in hardware, software, or a combination of hardware and software, including one or more signal processing and/or application specific integrated circuits.

The terminal device 100 may include: the mobile terminal includes a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present invention does not specifically limit the terminal device 100. In other embodiments of the present application, terminal device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processing Unit (GPU), an Image Signal Processor (ISP), a controller, a memory, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processing Unit (NPU), etc. Wherein, the different processing units may be independent devices or may be integrated in one or more processors.

The controller may be a neural center and a command center of the terminal device 100, among others. The controller can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc.

It should be understood that the interface connection relationship between the modules according to the embodiment of the present invention is only an exemplary illustration, and does not limit the structure of the terminal device 100. In other embodiments of the present application, the terminal device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The charging management module 140 is configured to receive a charging input from a charger. The charger can be a wireless charger or a wired charger.

The power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110. The power management module 141 receives input from the battery 142 and/or the charging management module 140, and provides power to the processor 110, the internal memory 121, the external memory, the display 194, the camera 193, the wireless communication module 160, and the like.

The wireless communication function of the terminal device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The terminal device 100 implements a display function by the GPU, the display screen 194, and the application processor. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the terminal device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The terminal device 100 may implement a shooting function through the ISP, the camera 193, the video codec, the GPU, the display screen 194, the application processor, and the like. In some embodiments, the terminal device 100 may include one or more cameras 193.

The ISP is used to process the data fed back by the camera 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and contrast of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image to the photosensitive element. The photosensitive element may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into image signal in standard RGB, YUV and other formats. In the embodiment of the present invention, the first video data and the image data may be respectively collected by 2 cameras 193, and in some embodiments, one or more first video data and one or more image data may also be respectively collected by more than 2 cameras 193. The camera 193 may be located on the front side of the terminal device, for example, above the touch screen, or may be located in another position, for example, on the back side of the terminal device. In addition, the camera 193 may further include a camera, such as an infrared camera or other cameras, for capturing images required for face recognition. The camera for collecting the image required by face recognition is generally located on the front side of the terminal device, for example, above the touch screen, and may also be located at other positions, for example, on the back side of the terminal device. In some embodiments, terminal device 100 may include other cameras. The terminal device may further comprise a dot matrix emitter (not shown) for emitting light.

The digital signal processor is used for processing digital signals, and can process digital image signals and other digital signals. For example, when the terminal device 100 selects a frequency point, the digital signal processor is used to perform fourier transform or the like on the frequency point energy.

Video codecs are used to compress or decompress digital video. The terminal device 100 may support one or more video codecs. In this way, the terminal device 100 can play or record video in a plurality of encoding formats, such as: moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, and the like.

The NPU is a neural-network (NN) computing processor, which processes input information quickly by referring to a biological neural network structure, for example, by referring to a transfer mode between neurons of a human brain, and can also learn by itself continuously. The NPU can implement applications such as intelligent recognition of the terminal device 100, for example: image recognition, face recognition, speech recognition, text understanding, and the like.

The external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to extend the storage capability of the terminal device 100. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music, video, etc. are saved in an external memory card.

The internal memory 121 may be used to store computer-executable program code, which includes instructions. The processor 110 executes various functional applications of the terminal device 100 and data processing by executing instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. The storage program area may store an operating system, applications (such as a face recognition function, a video recording function, a photographing function, an image processing function, and the like) required by at least one function, and the like. The storage data area may store data created during use of the terminal device 100, and the like. In addition, the internal memory 121 may include a high-speed random access memory, and may further include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (UFS), and the like.

The terminal device 100 may implement an audio function through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal.

The speaker 170A, also called a "horn", is used to convert the audio electrical signal into an acoustic signal.

The receiver 170B, also called "earpiece", is used to convert the electrical audio signal into an acoustic signal.

The microphone 170C, also referred to as a "microphone," is used to convert sound signals into electrical signals.

The earphone interface 170D is used to connect a wired earphone. The headset interface 170D may be the USB interface 130, or may be an open mobile platform (OMTP) standard interface of 3.5mm, a cellular telecommunications industry association (cellular telecommunications industry association) standard interface of the USA.

The pressure sensor 180A is used for sensing a pressure signal, and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A can be of a wide variety, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like.

The gyro sensor 180B may be used to determine the motion attitude of the terminal device 100. In some embodiments, the angular velocity of terminal device 100 about three axes (i.e., x, y, and z axes) may be determined by gyroscope sensor 180B.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode.

The ambient light sensor 180L is used to sense the ambient light level. The terminal device 100 may adaptively adjust the brightness of the display screen 194 according to the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust the white balance when taking a picture.

The fingerprint sensor 180H is used to collect a fingerprint. The terminal device 100 may utilize the collected fingerprint characteristics to unlock a fingerprint, access an application lock, photograph a fingerprint, answer an incoming call with a fingerprint, and the like. The fingerprint sensor 180H may be disposed below the touch screen, the terminal device 100 may receive a touch operation of a user on the touch screen in an area corresponding to the fingerprint sensor, and the terminal device 100 may collect fingerprint information of a finger of the user in response to the touch operation, so as to implement a related function.

The temperature sensor 180J is used to detect temperature. In some embodiments, the terminal device 100 executes a temperature processing policy using the temperature detected by the temperature sensor 180J.

The touch sensor 180K is also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is used to detect a touch operation applied thereto or nearby. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output associated with the touch operation may be provided through the display screen 194. In other embodiments, the touch sensor 180K may be disposed on the surface of the terminal device 100, different from the position of the display screen 194.

The keys 190 include a power-on key, a volume key, and the like. The keys 190 may be mechanical keys. Or may be touch keys. The terminal device 100 may receive a key input, and generate a key signal input related to user setting and function control of the terminal device 100.

Indicator 192 may be an indicator light that may be used to indicate a state of charge, a change in charge, or a message, missed call, notification, etc.

The SIM card interface 195 is used to connect a SIM card. The SIM card can be brought into and out of contact with the terminal device 100 by being inserted into the SIM card interface 195 or being pulled out of the SIM card interface 195. In some embodiments, the terminal device 100 employs eSIM, namely: an embedded SIM card. The eSIM card may be embedded in the terminal device 100 and cannot be separated from the terminal device 100.

The terminal device 100 may be a camera, a smart phone, a smart wearable device, a tablet computer, a laptop computer, and the like having the above functions, which is not particularly limited in the embodiment of the present invention.

Referring to fig. 3, fig. 3 is a block diagram of a software structure of a terminal device according to an embodiment of the present invention.

The layered architecture divides the software into several layers, each layer having a clear role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system is divided into four layers, an application layer, an application framework layer, an Android runtime (Android runtime) and system library, and a kernel layer from top to bottom.

The application layer may include a series of application packages.

As shown in fig. 3, the application package may include camera, gallery, calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, etc. applications (also referred to as applications). The method can further comprise related video processing application related to the application, and the video processing application can be used for processing the original video data by using one video generation method in the application, so that a high frame rate is obtained. High resolution video data.

The application framework layer provides an Application Programming Interface (API) and a programming framework for the application program of the application layer. The application framework layer includes a number of predefined functions.

As shown in FIG. 3, the application framework layers may include a window manager, content provider, view system, phone manager, resource manager, notification manager, and the like.

The window manager is used for managing window programs. The window manager can obtain the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like.

Content providers are used to store and retrieve data and make it accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.

The view system includes visual controls such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, the display interface including the short message notification icon may include a view for displaying text and a view for displaying pictures. For example, the video recording interface may include a related high-definition video recording control, and by clicking the high-definition video recording control, the video generation method in the present application may be implemented to synchronously acquire original video data and image data, and complete a series of processing on the original video data, thereby obtaining a high frame rate. High resolution video data.

The phone manager is used to provide a communication function of the terminal device 100. Such as management of call status (including on, off, etc.).

The resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and the like.

The notification manager enables the application to display notification information in the status bar, can be used to convey notification-type messages, can disappear automatically after a short dwell, and does not require user interaction. Such as a notification manager used to inform download completion, message alerts, etc. The notification manager may also be a notification that appears in the form of a chart or scroll bar text at the top status bar of the system, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog interface. For example, text information is prompted in the status bar, a prompt tone is given, the terminal device vibrates, an indicator light flickers, and the like. For example, when performing the high definition video recording according to the present application, the user may be prompted by text information on a video recording end interface to complete the video recording, and a video file obtained by processing the original video data by using one of the video generation methods in the present application and a video file obtained by not processing the original video data may be generated. When the high-definition video recording is performed but the memory of the terminal device 100 is insufficient, the user is prompted through the corresponding text information that the memory is insufficient, the video recording cannot be performed, and the like.

The Android Runtime comprises a core library and a virtual machine. The Android runtime is responsible for scheduling and managing an Android system.

The core library comprises two parts: one part is a function which needs to be called by java language, and the other part is a core library of android.

The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application layer and the application framework layer as binary files. The virtual machine is used for performing the functions of object life cycle management, stack management, thread management, safety and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface managers (surface managers), media Libraries (Media Libraries), three-dimensional graphics processing Libraries (e.g., openGL ES), 2D graphics engines (e.g., SGL), and the like.

The surface manager is used to manage the display subsystem and provide a fusion of the 2D and 3D layers for multiple applications.

The media library supports a variety of commonly used audio, video format playback and recording, and still image files, among others. The media library may support a variety of audio-video encoding formats, such as MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, and the like. The video formats referred to in this application may be, for example, RM, RMVB, MOV, MTV, AVI, AMV, DMV, FLV, etc.

The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, image rendering, synthesis, layer processing and the like.

The 2D graphics engine is a drawing engine for 2D drawing.

The kernel layer is a layer between hardware and software. The inner core layer at least comprises a display driver, a camera driver, an audio driver and a sensor driver. For example, in the present application, after receiving the corresponding video recording command, the video recording device may drive

In the prior art, when a user wants to record a video, the user may refer to fig. 4a, fig. 4b, and fig. 4c for the operation process of the terminal device. As shown in fig. 4a, the terminal device displays a video recording interface 401, where the video recording interface 401 may include a shooting control 402, a front and back shooting control 403, a video library control 404, a shooting mode control group 405 (including a large aperture control 405A, a night scene control 405B, a video recording control 405C, a shooting control 405D, a portrait control 405E, and a more control 405F, for example), a setting control 406, and other controls (such as a flash control and a zoom control, etc.). When the user wants to record, the user may click the recording control 405C to place the terminal device in the recording mode, and then start recording through the input operation 407 (e.g., clicking). Optionally, before recording, the user may set the resolution, the frame rate, and the like of the recorded video to obtain the recorded video meeting the requirement. As shown in fig. 4b, the terminal device receives an input operation (e.g., a click) from a user with respect to the setting control 406, and in response to the input operation, the terminal device displays a setting interface 408, wherein the setting interface 408 may include a plurality of video resolution controls (e.g., a (16. The user can set the resolution and the frame rate which are required by the user by clicking the corresponding video resolution control and the video frame rate control. As shown in fig. 4b, when the 4K resolution is selected, the terminal device may provide only the automatic frame rate (for example, the frame rate may be automatically adjusted to be less than or equal to 30fps at the 4K resolution) and 30fps due to the higher resolution. As shown in fig. 4c, when 1080P resolution is selected, the resolutions that the end device may provide include automatic frame rate (e.g., a frame rate that may be automatically adjusted to less than or equal to 60fps at 1080P resolution), 30fps, and 6 fps. The user can set the frame rate of the video to be higher 60fps by clicking the 60fps control 411, so that the video picture is smoother.

Obviously, as can be seen from the prior art shown in fig. 4a, fig. 4b, and fig. 4c, if a user wants to set a relevant resolution and frame rate by himself and perform video recording, so as to obtain a video recording meeting his own requirements, the resolution and frame rate are limited by the terminal device itself. For example, if a user wants to obtain a high-resolution video, only a low frame rate can be selected, and if the user wants to obtain a high-frame rate video, only a low resolution can be selected, which cannot satisfy the requirements of most users for high-resolution and high-frame rate videos. According to the video generation method provided by the embodiment of the invention, when recording, the terminal device can simultaneously and respectively acquire the original video data (such as high frame rate and low resolution) and the image data (such as high resolution and low frame rate) through the two cameras, and process the original video data by referring to the image data to obtain the video data with high frame rate and high resolution, so that the video picture is smooth and clear, and the user requirements are met.

To facilitate understanding of the embodiments of the present invention, the following exemplary list application scenarios to which a video generation method in the present application is applicable, and may include the following 2 scenarios.

In a first scene, terminal equipment respectively and simultaneously acquires original video data and image data through two cameras, and finally generates video data with a high frame rate and a high resolution:

referring to fig. 5, fig. 5 is a schematic view of an application scenario of a video generation method according to an embodiment of the present invention, where the application scenario includes a terminal device (a smart phone is taken as an example in fig. 5). And the terminal device can comprise a relevant shooting module, a display, a processor and the like. The shooting module, the display and the processor can perform data transmission through a system bus. The shooting module may include 2 cameras, such as camera1 and camera2 in fig. 5, and camera1 and camera2 may be controlled by an ISP. The camera1 and the camera2 can convert captured light source signals into digital signals, respectively complete the acquisition of original video data and image data, and then transmit the acquired original video data and image data to the processor through the system bus. The processor processes video frames included in the original video data by using a video generation method in the application according to the acquired original video data and the acquired image data, for example, super-resolution reconstruction, image high-frequency fusion and the like are included, so that the finally processed video data has a frame rate and a resolution meeting the requirements of a user.

When a user wants to perform high frame rate and high resolution video recording, the user may refer to fig. 6a, fig. 6b, fig. 6c, and fig. 6d for the operation process of the terminal device. For example, as shown in fig. 6a, the terminal device displays a video recording setting interface 601, where the video recording setting interface may include a plurality of video resolution controls (e.g. 1080P control 602 shown in fig. 6a (16). When the user wants to obtain a video with high frame rate and high resolution, the user can set the resolution and frame rate of the video captured by the video camera1 to 1080P and 60fps by clicking (16). Compared with fig. 4b and 4c, in the embodiment of the present invention, the video recording assist function may be turned on through an input operation 605 (e.g., clicking) on the video recording assist control 604 shown in fig. 6a, so as to simultaneously call two cameras (e.g., including the camera1 for video recording and the camera2 for picture taking) to respectively acquire original video data and image data during video recording. As shown in fig. 6b, the terminal device receives an input operation 606 (e.g. click) of the user for setting the control, and in response to the input operation 606, the terminal device displays an image capture setting interface 607. As shown in fig. 6b, the image capture settings interface 607 may include a plurality of image frame rate controls (e.g., automatic, 10fps, 5fps, 2fps, etc. as shown in fig. 6 b) and a plurality of image resolution controls (e.g., automatic, 4K, 1080P, 720P, etc. as shown in fig. 6 b). The embodiments of the present invention do not specifically limit the video frame rate and the video resolution, and the specific values and the number of the image frame rate and the image resolution. The user can set the frame rate and resolution (for example, set the frame rate of 20fps and the resolution of 4K as shown in fig. 6 b) of the image captured by the corresponding camera2 for photographing by clicking the corresponding image frame rate control and image resolution control. After finishing setting, the user can record video, and during the video, camera1 can operate in the proscenium, and the original video data that camera1 gathered can provide the preview in the proscenium, shows through terminal equipment's display.

For example, as shown in fig. 5, the raw video data captured by the camera1 may be a video frame including multiple frames with a high frame rate (e.g., 60 fps) and a low resolution (e.g., 1080P). And the image data collected by the camera2 may be a plurality of images with high resolution (e.g. 4K) and low frame rate (e.g. 20 fps). As shown in fig. 5, although the original video data acquired by the camera1 has a higher frame rate and smoother pictures, the resolution is too low, and if the original video data is directly output without being processed, the generated video pictures have low definition and poor quality, which seriously affects the user's impression. The processor may process a video frame included in the original video data by using a video generation method in the present application according to the original video data and the image data acquired by the camera1, for example, including super-resolution reconstruction and image high-frequency fusion shown in fig. 5. Thus, the video data with high frame rate and high resolution as shown in fig. 5 is obtained, i.e. the finally generated video picture is smooth and clear and has richer details. Optionally, after the video recording auxiliary function is started, the terminal device may synchronously acquire original video data and image data during a video recording process, and perform synchronous real-time processing on the original video data, that is, when the video recording is finished, the terminal device directly generates a high-frame-rate and high-resolution video. In addition, optionally, the terminal device may also perform only the acquisition operation during the video recording process, but not perform the related video processing synchronously, and when the video recording is finished, the terminal device may store the original video data and the image data that are acquired synchronously during the video recording process. Then, the terminal device may respond to an input operation of a user for the relevant control, and process the original video data based on the image data to obtain processed video data. Furthermore, the processor can also store the unprocessed and processed video data into the memory, and control the display to display related video pictures according to the unprocessed and processed video data, so that a user can intuitively and timely master the picture fluency and definition of the video generated after the processing according to the video. For example, as shown in fig. 6c, the terminal device receives an input operation 611 (e.g., clicking) from the user with respect to the play control 609 to display a play interface 608, and the play interface 608 displays unprocessed video, which has low resolution and blurred picture, as shown in fig. 6 c. Further, as shown in fig. 6d, the terminal device receives an input operation 612 (e.g., clicking) by the user with respect to the play control 610 to display a play interface 613, where the play interface 613 displays the processed video, and as shown in fig. 6d, the resolution of the processed video is high, and the screen is clear. The user can also share, edit, collect, and delete the unprocessed and processed video through a sharing control, an editing control, a collecting control, a deleting control, and the like as in fig. 6c and fig. 6d, respectively. For example, if the user feels that the processed video is good and the video image is smooth and clear after watching the unprocessed video and the processed video, the user may select to delete the unprocessed video, which is not specifically limited in this application. As described above, the terminal device may be a camera, a smart phone, a tablet computer, or the like, which has the functions of video data acquisition, image data acquisition, video processing, display, and the like, which is not specifically limited in this application.

And in a second scene, the terminal equipment is connected with the computing equipment, the computing equipment processes the original video data and the image data which are acquired by the terminal equipment and sent to the computing equipment, and finally, the video data with high frame rate and high resolution is generated:

referring to fig. 7, fig. 7 is a schematic view of an application scenario of another video generation method according to an embodiment of the present invention, where the application scenario includes a terminal device (for example, a smart phone in fig. 7) and a computing device (for example, a desktop computer in fig. 7). The terminal device and the computing device can perform data transmission in a wireless communication mode such as Bluetooth, wi-Fi or a mobile network or a wired communication mode such as a data line. Wherein, terminal equipment can include 2 cameras, for example camera1 and camera2 shown in fig. 7, and camera1 and camera2 can convert the light source signal that catches into digital signal, accomplish the collection of original video data and image data respectively. For example, as shown in fig. 7, the raw video data collected by the camera1 may be a video frame including multiple frames with a high frame rate (for example, the highest frame rate that can be achieved by the terminal device, such as 60 fps) and a low resolution (for example, at the highest frame rate that can be achieved by the terminal device, the resolution that can be satisfied, such as 720P). The image data collected by the camera2 may be a plurality of images including a high resolution (for example, the highest resolution that can be achieved by the terminal device, such as 4K) and a low frame rate (for example, at the highest resolution that can be achieved by the terminal device, a satisfactory frame rate, such as 20fps, can be achieved). The terminal device may then send the acquired raw video data and image data to the computing device via the wireless/wired communication described above. The computer device processes the acquired original video data and image data by using a video generation method in the present application, which may include super-resolution reconstruction, high-frequency image fusion, and the like as shown in fig. 7, for example. Finally, video data meeting the user requirements, such as the video data with high frame rate and high resolution shown in fig. 7, is generated, and the video picture is smooth and clear. Further, the computing device may also store the processed video data locally in the computing device, and select to send the video data to the terminal device or other devices. Further, the terminal device may also select to send multiple copies of the original video data and the image data corresponding to each other to the computing device, where the multiple copies of the original video data and the image data corresponding to each other may be the original video data and the image data that are obtained by shooting in advance through the terminal device and stored locally in the terminal device. Secondly, the computing device may process the multiple sets of original video data simultaneously or sequentially by using one video generation method in the present application, and finally generate multiple sets of video data (for example, video data with high frame rate and high resolution) meeting the user requirements. As described above, the terminal device may be a camera, a smart phone, a tablet computer, or the like having the above functions; the computing device may be a smart phone, a smart wearable device, a tablet computer, a notebook computer, a desktop computer, and the like, which have the above functions, and this is not particularly limited in this application.

It is understood that the application scenarios and system architectures in fig. 5 and fig. 7 are only some exemplary implementations in the embodiments of the present invention, and the application scenarios in the embodiments of the present invention include, but are not limited to, the above application scenarios and system architectures. In some possible embodiments, more original video data and image data may be acquired simultaneously based on three cameras or more cameras to achieve processing of video data, so that a more smooth, clearer and higher-quality video is obtained, and other scenes and examples are not listed and described one by one.

Referring to fig. 8, fig. 8 is a flowchart illustrating a video generation method according to an embodiment of the present invention, where the method is applicable to the application scenario and the system architecture described in fig. 5 or fig. 7, and is specifically applicable to the terminal device 100 of fig. 2. The following description will be given taking the processor 110 whose execution main body is inside the terminal device 100 in fig. 2 as an example, with reference to fig. 8. The method may include the following steps S801 to S804.

Step S801: the method comprises the steps of obtaining first video data collected by a first camera in a first time period, wherein the first video data comprise a plurality of frames of first video frames.

Specifically, after receiving the video recording instruction, the terminal device may invoke a first camera to collect first video data in a first time period, and the processor acquires the first video data, where the first video data may include multiple frames of first video frames. The resolution of the first video frame of each of the multiple first video frames is a first resolution, where the first resolution may be, for example, 4K, 1080P, 720P, and so on, the first resolution may be a default value (for example, the maximum resolution that the terminal device can satisfy on the premise of a maximum video frame rate, such as 60fps, such as 1080P), the first resolution may also be set by the user through the terminal device, and so on. Optionally, the frame rate of the first video frame may be a first frame rate, where the first frame rate may be, for example, 60fps (that is, a frame rate of 60 video frames per second), 30fps (that is, a frame rate of 30 video frames per second), and the like, and optionally, the first frame rate may be a default value (for example, when the first resolution is 1080P, a maximum frame rate that the terminal device may meet, such as 60 fps), and the first frame rate may also be set by a user through the terminal device, and the like, which is not specifically limited in this embodiment of the present invention.

Step S802: and acquiring image data acquired by a second camera in the first time period, wherein the image data comprises one or more images.

Specifically, after receiving the video recording instruction, the terminal device may simultaneously call the second camera to acquire image data within the first time period, and the processor acquires the image data, where the image data may include one or more images. The resolution of each of the one or more images is a second resolution, the second resolution may be, for example, 4K, 1080P, 720P, and so on, the second resolution may be a default value (for example, the maximum resolution that the terminal device can satisfy at the maximum frame rate of photographing, for example, 20fps, for example, 4K), the second resolution may also be set by the user through the terminal device, and so on. Alternatively, the photographing frame rate corresponding to the image data may be a second frame rate, and the second frame rate may be, for example, 20fps (i.e., 20 images per second), 10fps (i.e., 10 images per second), 2fps (i.e., 2 images per second), and the like. Optionally, the second frame rate may be a default value (for example, a maximum frame rate that the terminal device can meet, such as 20 fps), the second frame rate may also be set by the user through the terminal device, and the like, which is not specifically limited in this embodiment of the present invention. Optionally, referring to fig. 9, fig. 9 is a schematic diagram of a dual camera according to an embodiment of the present invention. As shown in fig. 9, a first camera (camera 1) and a second camera (camera 2) in the terminal device may be connected through a connector, and after receiving a video recording instruction, the terminal device may simultaneously call the first camera and the second camera to respectively collect first video data and image data in the first time period, and so on, which is not described herein again.

Step S803: and adjusting the resolution of each first video frame in the multiple first video frames to obtain second video data, wherein the second video data comprises multiple second video frames.

Specifically, the processor in the terminal device adjusts the resolution of each of the plurality of frames of the first video frame according to the resolution (i.e., the second resolution) of each of the one or more images, so as to obtain corresponding second video data, where the second video data includes a plurality of frames of the second video frame, and optionally, the resolution of each of the plurality of frames of the second video frame may be the second resolution.

For example, the first resolution is 720P, the first frame rate is 60fps; the second resolution is 4K, the second frame rate is 20fps, the second resolution is greater than the first resolution, and the first frame rate is greater than the second frame rate. Namely, the first camera collects the first video data with high frame rate and low resolution, and the second camera collects the image data with low frame rate and high resolution. Optionally, to ensure the accuracy and reasonableness of the subsequent series of processing, the information contained in the first video data and the image data should be maximally close to each other, the visual field range of the first camera and the visual field range of the second camera should be as consistent as possible, and the aspect ratio of the first video frame of each frame and the aspect ratio of each image should be consistent. Optionally, the processor may perform super-resolution reconstruction on each first video frame in the plurality of first video frames based on a first model (e.g., a convolutional neural network model) obtained through pre-training, to obtain a second video frame corresponding to each first video frame, where the resolution of the second video frame may be a second resolution (e.g., 4K), so as to obtain the second video data with high definition. It will be appreciated that in some possible embodiments, the second video data may also be obtained by performing super-resolution reconstruction on the first video frames of each frame through a model or method other than a convolutional neural network. Furthermore, the second video data may also be obtained by a method other than super-resolution reconstruction, and this is not particularly limited by the embodiment of the present invention. It should be noted that the present invention aims to adjust the resolution of the first video frame of each frame based on the first video data by referring to the resolution (second resolution) of the image data acquired by the second camera, so as to obtain the second video data. Optionally, the second resolution may also be less than or equal to the first resolution, so that according to the second resolution, the second model obtained through pre-training may be used to perform resolution compression on the first data frame of each frame, and the like, so as to obtain low-definition, small-memory, lightweight second video data, and the like, which is not limited in this embodiment of the present invention.

Optionally, the method for training to obtain the first model may specifically include the following steps s11 to s12:

step s11: training samples are obtained.

Specifically, a training sample is obtained, which may include a set of original images and a set of target images. The original image set may include N original images, and a resolution of each of the N original images may be the first resolution. The target image set may include N target images, and a resolution of each of the N target images may be the second resolution. The N original images correspond to the N target images one to one, and the second resolution may be greater than the first resolution, that is, the target image is a high-definition version of the original image corresponding to the target image, and image contents included in the target image and the original image corresponding to the target image are consistent. Alternatively, the N original images and the N target images may be images captured by a first camera and a second camera of the terminal device within the same time period and the same capturing field of view at a certain capturing frame rate (for example, 20 fps), respectively. Optionally, the N original images and the N target images may also be obtained by shooting through other devices and stored in other devices locally or in a cloud server, and the like, and the terminal device may obtain the N original images and the N target images through a wired network or a wireless network. Wherein N is an integer greater than or equal to 1.

Step s12: and training to obtain a first model by taking the N original images and the N target images as training input and taking the target images corresponding to the N original images as N labels.

Specifically, the terminal device takes the N original images and the N target images as training inputs, optimizes various parameters and the like in an initial model through supervised or weakly supervised deep learning, and thereby trains to obtain the first model, which can realize that an image with a first resolution (for example, a low-definition image) is input and a corresponding image with a second resolution (for example, a high-definition image) is output. Optionally, the training mode may be direct training, generative confrontation training, and the like, which is not specifically limited in this embodiment of the present invention.

Optionally, the second resolution may also be less than or equal to the first resolution, and the step of training to obtain the second model may refer to the above step s11 to step s12. The second model may enable an image of a first resolution (e.g., a high-definition image) to be input and a corresponding image of a second resolution (e.g., a low-definition image) to be output.

Optionally, the first model may be a convolutional neural network model, and performing super-resolution reconstruction on the first video frame of each frame based on the convolutional neural network model to obtain the second video data, which may specifically include the following steps s21 to s24:

step s21: and amplifying each first video frame in the plurality of first video frames to a target size.

Specifically, the processor enlarges each of the plurality of first video frames to a target size, which may be determined by the second resolution. Optionally, in some possible embodiments, each of the plurality of first video frames may also be scaled down to its corresponding target size.

Step s22: and performing image feature extraction on the first video frame of each frame amplified to the target size, and performing a first convolution operation.

Specifically, the processor performs image feature extraction on each frame of the first video frame amplified to the target size to obtain at least one image block corresponding to each frame of the first video frame, performs a first convolution operation on each image block in the at least one image block to obtain a feature vector corresponding to each image block, and the feature vectors corresponding to all the image blocks can form a feature matrix, so as to obtain a first feature matrix corresponding to each frame of the first video frame.

Step s23: a second convolution operation is performed.

Specifically, through a second convolution operation, the first feature matrices corresponding to the first video frames of each frame are subjected to nonlinear mapping, so as to obtain second feature matrices corresponding to the first video frames of each frame.

Step s24: a third convolution operation is performed.

And restoring the second feature matrix corresponding to each first video frame through a third convolution operation to obtain a second video frame corresponding to each first video frame, so as to obtain second video data, wherein the second video data may include the second video frame corresponding to each first video frame. Alternatively, the third convolution operation may be a process of deconvolution.

It should be noted that, in some possible embodiments, performing super-resolution reconstruction on the first video frame of each frame based on the convolutional neural network model may include more or fewer steps s21 to s24, or even different steps, which is not specifically limited in this embodiment of the present invention.

Step S804: and performing image fusion on one or more second video frames in the plurality of second video frames based on the image data to obtain third video data, wherein the third video data comprises a plurality of third video frames.

Specifically, the one or more images may be M images, the first video frame of the plurality of frames may be a first video frame of N1 frames, and the second video frame of the plurality of frames may be a second video frame of N2 frames. In this embodiment of the present invention, it is not specifically limited, and it is understood that N1 and M specifically depend on the first frame rate and the second frame rate, respectively. Specifically, the terminal device refers to the jth image of the M images for the ith frame of the N2 frame of second video frames based on the image data, performs image fusion on the ith frame of second video frame, obtains a third video frame corresponding to the ith frame of second video frame, and thus obtains third video data. Wherein, the value of i includes an integer greater than or equal to 1 and less than or equal to N2, and the value of j includes an integer greater than or equal to 1 and less than or equal to M. Optionally, to ensure the accuracy of image fusion, in the M images, a time difference between an acquisition time corresponding to the jth image and an acquisition time corresponding to the ith frame of the second video frame is minimum, that is, content information included in the jth image and the ith frame of the second video frame are closest to each other, and there are more consistent regions. The third video data may include a plurality of third video frames, and the plurality of third video frames may be N3 third video frames, where N3 is an integer greater than or equal to 1, and N1 is generally equal to N2 and equal to N3.

For example, the first resolution is 720P, the first frame rate is 60fps; the second resolution is 4K, the second frame rate is 20fps, the recording duration is 2 seconds, the total number of the second resolution may include 120 second video frames and 40 images, the acquisition time corresponding to the 1 st second video frame and the 1 st image may be 0 second(s), the acquisition time of the 2 nd second video frame may be 16.7 milliseconds (ms), the acquisition time of the 2 nd image may be 50ms, the acquisition time of the 3 rd second video frame may be 33.3ms, the acquisition time of the 3 rd image may be 0.1s, and the like, which is not described herein again. When the second video frame of the 1 st frame and the second video frame of the 2 nd frame are subjected to image fusion (for example, fusion of high-frequency information of the images), the 1 st image can be referred to; when image fusion is performed on the second video frame of the 3 rd frame, the second video frame of the 4 th frame, and the second video frame of the 5 th frame, the 2 nd image may be referred to, and so on, which is not described herein again.

Optionally, before image fusion, dynamic region detection may be performed on the ith frame of the second video frame and the jth image, respectively, to determine a dynamic region and a static region in the ith frame of the second video frame and the jth image. The dynamic region may be a region of a moving object, and specifically may be a region where a second video frame (or a jth image) of the ith frame is different from a second video frame (or an image before the jth image) before the second video frame of the ith frame, and the like. The static area may be a substantially constant background area such as buildings, grass and trees, etc. Referring to fig. 10a to 10b, fig. 10a to 10b are schematic diagrams illustrating details of image high-frequency fusion according to an embodiment of the present invention. As shown in fig. 10a, the dynamic area of the second video frame of the ith frame may be an area within a white dashed box as shown in fig. 10a, the dynamic area may include a walking person as shown in fig. 10a, and the static area of the second video frame of the ith frame may be a background area (including grass, trees, buildings, etc.) as shown in fig. 10 a. Optionally, since the second video frame of the ith frame and the jth image do not necessarily come from the same time, the dynamic areas in the second video frame of the ith frame and the jth image are not necessarily consistent, and even have a large difference, and the content information included in the dynamic area in the second video frame of the ith frame better conforms to the actual situation, the information of the dynamic area in the second video frame of the ith frame can be retained. However, for the static area, the content of the second video frame of the ith frame is substantially consistent with that of the jth image, so that the static area in the jth image can be referred to for image fusion of the static area of the second video frame of the ith frame. Alternatively, if the image high frequency information of the static area in the jth image is greater than the image high frequency information of the static area in the ith frame second video frame, the image high frequency information of the static area in the ith frame second video frame may be replaced with the image high frequency information of the static area in the jth image. Optionally, the static area may be further subdivided into a plurality of sub-areas (for example, including the static area 1 shown in fig. 10 b), and according to the respective image high-frequency information of the plurality of sub-areas, the image fusion is performed on the ith frame of the second video frame, so as to obtain a third video frame corresponding to the ith frame of the second video frame, thereby obtaining third video data, details of a video picture of the third video data are richer and clearer, and the video quality is further improved. For example, as shown in fig. 10c, the left diagram may be a static area 1 in the first video frame corresponding to a static area 1 in the second video frame of the ith frame, and obviously, the static area 1 in the first video has low resolution, blurred details, and poor appearance. The right image in fig. 10c may be the static region 1 in the third video frame corresponding to the static region 1 in the second video frame of the ith frame, and obviously, the resolution of the static region in the third video frame is higher, and the details are rich and clear through the super-resolution reconstruction, the high-frequency image fusion and the like.

Referring to fig. 11a to 11b, fig. 11a to 11b are schematic diagrams illustrating overall steps of a video generation method according to an embodiment of the present invention. As shown in fig. 11a, after the relevant camera application is started and a Video recording instruction is received, the terminal device simultaneously calls the first camera (camera 1) and the second camera (camera 2) to respectively acquire the first Video data Video1 (high frame rate, low resolution) and the Image data Image (low frame rate, high resolution). The first video data may be as shown in the first video data in fig. 11b, but the frame rate is higher, but the resolution is low, the picture is blurred and unclear, and the detail loss is large; the image data may be as shown in fig. 11b, but the frame rate is low, but the resolution is high, the picture is clear, and the details are rich. Wherein, in the video recording process, first camera can operate in the proscenium, and the first video data of this collection can be used for providing the preview when the video recording. And the second camera can run in the background and output the collected image data. Next, the terminal device may train a Convolutional Neural Network (CNN) model obtained in an offline state (offline) based on a first image set (e.g., a low-definition image set captured by a first camera) and a second image set (e.g., a high-definition image set captured by a second camera) according to the first image set (e.g., a low-definition image set captured by the first camera) and the second image set (e.g., a high-definition image set captured by the second camera) as shown in fig. 11a, and perform super-resolution reconstruction (e.g., AI (Artificial Intelligence) super-resolution reconstruction) on a plurality of frames of first Video frames included in the first Video data to obtain second Video data Video2. Then, referring to the image data, image fusion (for example, high-frequency image fusion shown in fig. 11 a) is performed on multiple frames of second Video frames included in the second Video data, so as to obtain third Video data Video3, where the third Video data can be shown in fig. 11b, and has a higher frame rate, a higher resolution, and a smooth and clear Video image. As shown in fig. 11a, the video generation method may be applied to online video recording (on line), for example, during the video recording process, when the first video data and the image data are synchronously acquired, a series of processes are performed on the first video data acquired in real time based on the image data acquired in real time, and when the video recording is finished, that is, the third video data with high resolution and high frame rate is generated, the later process may not be needed.

In addition, the application also provides a video generation method based on a single camera, and the method can comprise the following steps: raw data is acquired by a single camera at a target frame rate (e.g., a maximum frame rate that can be met by the terminal device, such as 60fps, i.e., 60 images are acquired per second). Secondly, based on the original data, frames are output through two paths of algorithms respectively, wherein the first path of algorithm can be a simpler algorithm commonly used in the prior art, and a one hundred percent frame output rate under a target frame rate can be ensured through the first path of algorithm, for example, the target frame rate is 60fps, and then a multi-frame image with the frame rate of 60fps can be obtained through the first path of algorithm. It can be understood that, since the first path algorithm is a simpler algorithm, the frame output rate is ensured, but the resolution of each frame of image is lower. The second-path algorithm may be a more complex algorithm, and an image with a higher resolution may be obtained through the second-path algorithm, so that, for example, on the premise that the target frame rate is 60fps, the second-path algorithm may only ensure a frame rate of 20fps or 10fps, or even lower. Optionally, the second-path algorithm may obtain multiple frames of high-resolution images at a fixed frame rate (e.g., 20fps, 10fps, etc.), and the second-path algorithm may also obtain multiple frames of high-resolution images at a non-fixed frame rate based on the key frames in the original data. Then, according to the multi-frame high-frame-rate and low-resolution images (for example, 60fps, 720p) obtained by the first algorithm and the multi-frame low-frame-rate and high-resolution images (for example, 10fps, 4k) obtained by the second algorithm, a series of operations such as image fusion are performed, so that the multi-frame high-frame-rate and high-resolution images (for example, 60fps, 4k) are finally obtained, that is, the high-frame-rate and high-resolution video is obtained.

Referring to fig. 12, fig. 12 is a schematic structural diagram of a video generating apparatus according to an embodiment of the present invention, where the video generating apparatus 20 may include: a first acquisition unit 201, a second acquisition unit 202, an adjustment unit 203, and an image fusion unit 205, wherein,

a first obtaining unit 201, configured to obtain first video data collected by a first camera in a first time period, where the first video data includes multiple frames of first video frames, and a resolution of each frame of the multiple frames of first video frames is a first resolution;

a second obtaining unit 202, configured to obtain image data collected by a second camera in the first time period, where the image data includes one or more images, and a resolution of each of the one or more images is a second resolution;

an adjusting unit 203, configured to adjust a resolution of each of the first video frames in the multiple frames of first video frames to obtain second video data, where the second video data includes multiple frames of second video frames, and a resolution of each of the second video frames in the multiple frames of second video frames is the second resolution;

an image fusion unit 205, configured to perform image fusion on one or more second video frames of the multiple second video frames based on the image data, so as to obtain third video data, where the third video data includes multiple third video frames.

In one possible implementation, the second resolution is greater than the first resolution; the adjusting unit 203 is specifically configured to:

In one possible implementation, the apparatus 20 further includes: a third acquisition unit 206 and a training unit 207, wherein,

a third obtaining unit 206, configured to obtain training samples, where the training samples include an original image set and a target image set; the original image set comprises N original images, the resolution of each original image in the N original images is the first resolution, the target image set comprises N target images, the resolution of each target image in the N target images is the second resolution, the N original images and the N target images are in one-to-one correspondence, and N is an integer greater than or equal to 1;

the training unit 207 is configured to train to obtain the first model by using the N original images and the N target images as training inputs and using target images corresponding to the N original images as N labels.

In one possible implementation, the first model is a convolutional neural network model; the adjusting unit 203 is further specifically configured to:

In one possible implementation, the apparatus 20 further includes:

a determining unit 204, configured to perform dynamic region detection on an ith frame of the multiple frames of second video frames and a jth image of the one or more images, respectively, and determine a dynamic region and a static region in the ith frame of second video frames and the jth image, where i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1.

In a possible implementation manner, the image fusion unit 205 is specifically configured to:

In one possible implementation, the second resolution is less than or equal to the first resolution; the adjusting unit 203 is specifically configured to:

It should be noted that, for the functions of the relevant units in the video generating apparatus 20 described in the embodiment of the present invention, reference may be made to the relevant descriptions of the relevant method embodiments described in fig. 1 to fig. 11b, and details are not repeated here.

Each of the units in fig. 12 may be implemented in software, hardware, or a combination thereof. The unit implemented in hardware may include a road junction furnace, an arithmetic circuit, an analog circuit, or the like. A unit implemented in software may comprise program instructions, considered as a software product, stored in a memory and executable by a processor to perform the relevant functions, see in particular the previous description.

Based on the description of the method embodiment and the device embodiment, the embodiment of the invention also provides the terminal equipment. Referring to fig. 13, fig. 13 is a schematic structural diagram of a terminal device according to an embodiment of the present invention, where the terminal device at least includes a processor 301, a shooting module 302, a display 303, and a computer-readable storage medium 304. The shooting module 302 may include a first camera and a second camera, where the first camera and the second camera may be respectively used to collect video data and image data, and the display 303 may be used to display the video data processed by the video generation method provided in the embodiment of the present invention. The processor 301, the photographing module 302, the display 303, and the computer-readable storage medium 304 in the terminal device may be connected by a bus or other means.

A computer-readable storage medium 304 may be stored in the memory of the terminal device, the computer-readable storage medium 304 being configured to store a computer program comprising program instructions, the processor 301 being configured to execute the program instructions stored by the computer-readable storage medium 304. The processor 301 (or CPU) is a computing core and a control core of the terminal device, and is adapted to implement one or more instructions, and specifically, adapted to load and execute one or more instructions so as to implement a corresponding method flow or a corresponding function; in one embodiment, the processor 301 according to the embodiment of the present invention may be configured to perform a series of processes for video generation, including: acquiring first video data acquired by a first camera in a first time period, wherein the first video data comprises a plurality of frames of first video frames, and the resolution of each frame of the plurality of frames of first video frames is a first resolution; acquiring image data acquired by a second camera in the first time period, wherein the image data comprises one or more images, and the resolution of each image in the one or more images is a second resolution; adjusting the resolution of each first video frame in the multiple first video frames to obtain second video data, wherein the second video data comprises multiple second video frames, and the resolution of each second video frame in the multiple second video frames is the second resolution; and performing image fusion on one or more second video frames in the plurality of second video frames based on the image data to obtain third video data, wherein the third video data comprises a plurality of third video frames and the like.

An embodiment of the present invention further provides a computer-readable storage medium (Memory), which is a Memory device in the terminal device and is used for storing programs and data. It is understood that the computer readable storage medium herein may include a built-in storage medium in the terminal device, and may also include an extended storage medium supported by the terminal device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also, one or more instructions, which may be one or more computer programs (including program code), are stored in the memory space and are adapted to be loaded and executed by the processor 301. It should be noted that the computer-readable storage medium may be a high-speed RAM memory, or may be a non-volatile memory (non-volatile memory), such as at least one disk memory; and optionally at least one computer readable storage medium remotely located from the aforementioned processor.

First video data (for example, including multiple frames of first video frames) and image data (for example, including one or more images) can be respectively acquired by two cameras in the same time period and the same shooting field, and then, according to the resolutions of multiple images in the image data, the resolution of multiple frames of video frames in the first video data can be adjusted to obtain second video data (for example, including multiple frames of second video frames), where the resolution of multiple frames of second video frames in the second video data is consistent with the resolutions of the multiple images; finally, based on the multiple images, image fusion is performed on one or more second video frames in the multiple second video frames, so as to further improve the video quality, and obtain third video data (for example, including multiple third video frames). The embodiment of the invention can be used in various daily video recording scenes, and the original video data and the image data synchronously acquired by the camera or the mobile phone are processed, so that the video data with high resolution and high frame rate are obtained, the video quality of the video recorded by the terminal equipment such as the camera or the mobile phone is improved, and the actual requirements of users are met. Optionally, in the present application, the resolution of the image may be greater than the resolution of the first video frame, or may be less than or equal to the resolution of the first video frame, that is, according to the acquired first video data and the image data, high-resolution and high-quality video data or low-resolution and light-weight video data that meets the actual needs of the user and occupies a small memory may be obtained. In addition, in some possible embodiments, the video data and the image data can be respectively collected by three cameras or more cameras, and the video data is processed based on the image data, so that the video quality is improved.

An embodiment of the present invention further provides a computer-readable storage medium, where the computer-readable storage medium may store a program, and when the program is executed, the program includes some or all of the steps described in the above video generation method embodiment.

Embodiments of the present invention also provide a computer program, which includes instructions that, when executed by a computer, enable the computer to perform some or all of the steps of any of the video generation methods.

In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

It should be noted that, for simplicity of description, the above-mentioned embodiments of the method are described as a series of acts, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred and that the acts and modules referred to are not necessarily required in this application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units is only one type of division of logical functions, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of some interfaces, devices or units, and may be an electric or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit may be implemented in the form of hardware, or may also be implemented in the form of a software functional unit.

The integrated unit may be stored in a computer-readable storage medium if it is implemented in the form of a software functional unit and sold or used as a separate product. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, a network device, or the like, and may specifically be a processor in the computer device) to execute all or part of the steps of the above-described method of the embodiments of the present application. The storage medium may include: a U-disk, a removable hard disk, a magnetic disk, an optical disk, a Read-only memory (ROM) or a Random Access Memory (RAM), and other various media capable of storing program codes.

The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art will understand that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method of video generation, comprising:

acquiring first video data acquired by a first camera in a first time period, wherein the first video data comprises a plurality of frames of first video frames, and the resolution of each frame of the plurality of frames of first video frames is a first resolution;

acquiring image data acquired by a second camera in the first time period, wherein the image data comprises one or more images, and the resolution of each image in the one or more images is a second resolution;

based on the image data, carrying out image fusion on one or more second video frames in the plurality of second video frames to obtain third video data, wherein the third video data comprises a plurality of third video frames;

before the performing image fusion on one or more second video frames of the plurality of second video frames based on the image data to obtain third video data, the method includes:

respectively performing dynamic region detection on an ith frame of second video frames in the multiple frames and a jth image in the one or more images, and determining a dynamic region and a static region in the ith frame of second video frames and the jth image, wherein i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1;

the image fusion of one or more second video frames of the plurality of second video frames based on the image data to obtain third video data includes:

2. The method of claim 1, wherein the second resolution is greater than the first resolution; the adjusting the resolution of each first video frame of the plurality of first video frames to obtain second video data includes:

3. The method of claim 2, further comprising:

4. The method of claim 2 or 3, wherein the first model is a convolutional neural network model; the super-resolution reconstruction of each frame of the multiple frames of first video frames based on the first model obtained by pre-training to obtain the second video data comprises:

and performing super-resolution reconstruction on each frame of first video frames in the multiple frames of first video frames based on the convolutional neural network model to obtain second video frames corresponding to the first video frames, wherein the second video data comprise the second video frames corresponding to the first video frames.

5. The method according to claim 1, wherein, among the acquisition times corresponding to the one or more images, the acquisition time corresponding to the jth image is the acquisition time with the smallest time difference with the acquisition time corresponding to the ith frame of the second video frame.

6. The method of any of claims 1, 2, 3 or 5, wherein the aspect ratio of each of the plurality of first video frames is consistent with the aspect ratio of each of the one or more images.

7. A video generation apparatus, comprising:

the image fusion unit is used for carrying out image fusion on one or more second video frames in the plurality of second video frames based on the image data to obtain third video data, wherein the third video data comprises a plurality of third video frames;

the device further comprises:

a determining unit, configured to perform dynamic region detection on an ith frame of the multiple frames of second video frames and a jth image of the one or more images, respectively, and determine a dynamic region and a static region in the ith frame of second video frames and the jth image, where i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1;

the image fusion unit is specifically configured to:

8. The apparatus of claim 7, wherein the second resolution is greater than the first resolution; the adjusting unit is specifically configured to:

9. The apparatus of claim 8, further comprising:

a third obtaining unit, configured to obtain training samples, where the training samples include an original image set and a target image set; the original image set comprises N original images, the resolution of each original image in the N original images is the first resolution, the target image set comprises N target images, the resolution of each target image in the N target images is the second resolution, the N original images correspond to the N target images in a one-to-one mode, and N is an integer greater than or equal to 1;

10. The apparatus of claim 8 or 9, wherein the first model is a convolutional neural network model; the adjusting unit is further specifically configured to:

11. The apparatus according to claim 7, wherein, among the capturing moments corresponding to the one or more images, the capturing moment corresponding to the jth image is the capturing moment with the smallest time difference with the capturing moment corresponding to the ith frame of the second video frame.

12. The apparatus of any of claims 7, 8, 9 or 11, wherein the aspect ratio of each of the plurality of frames of the first video frame is consistent with the aspect ratio of each of the one or more images.

13. A terminal device comprising a processor, a first camera and a second camera coupled to the processor:

the second camera is used for collecting image data in the first time period;

the processor is configured to:

acquiring the first video data, wherein the first video data comprises a plurality of frames of first video frames, and the resolution of each frame of first video frames in the plurality of frames of first video frames is a first resolution;

before the performing, based on the image data, image fusion on one or more second video frames of the plurality of second video frames to obtain third video data, the processor is further configured to:

respectively performing dynamic region detection on an ith frame of the second video frames and a jth image of the one or more images, and determining a dynamic region and a static region in the ith frame of the second video frames and the jth image, wherein i is an integer greater than or equal to 1, and j is an integer greater than or equal to 1;

the processor is specifically configured to:

14. The terminal device according to claim 13, wherein the second resolution is greater than the first resolution; the processor is specifically configured to:

15. The terminal device of claim 14, wherein the second resolution is greater than the first resolution; the processor is further configured to:

16. A terminal device according to claim 14 or 15, wherein the first model is a convolutional neural network model; the processor is specifically configured to:

17. The terminal device according to claim 13, wherein, in the respective acquisition time instants corresponding to the one or more images, the acquisition time instant corresponding to the jth image is an acquisition time instant having a smallest time difference with an acquisition time instant corresponding to the ith frame of the second video frame.

18. The terminal device of any one of claims 13, 14, 15 or 17, wherein the aspect ratio of each of the plurality of frames of the first video frame is consistent with the aspect ratio of each of the one or more images.

19. A terminal device, comprising: the device comprises a processor, a memory, a display screen, a first camera and a second camera;

the memory, the display screen, the first camera, and the second camera are coupled to the processor, the memory is configured to store computer program code, the computer program code includes computer instructions, and the processor calls the computer instructions to cause the terminal device to perform:

displaying the third video data on the display screen;

before the performing image fusion on one or more second video frames in the multiple second video frames based on the image data to obtain third video data, the method includes:

20. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method of any of the preceding claims 1 to 6.

21. A computer program, characterized in that the computer program comprises instructions which, when executed by a computer, cause the computer to carry out the method according to any one of claims 1 to 6.