CN112200739A

CN112200739A - Video processing method and device, readable storage medium and electronic equipment

Info

Publication number: CN112200739A
Application number: CN202011062278.3A
Authority: CN
Inventors: 李旭; 李梦醒
Original assignee: Beijing Dami Technology Co Ltd
Current assignee: Beijing Dami Technology Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2021-01-08

Abstract

The embodiment of the invention discloses a video processing method, a video processing device, a readable storage medium and electronic equipment. The embodiment of the invention obtains at least two image sequences; determining a transition image set according to the at least two image sequences, wherein the transition image set comprises a plurality of transition images, and the transition images are images positioned at a preset position of a starting part or a last part in the image sequences; determining a target image according to the transition image set; and splicing the at least two image sequences based on the target image to obtain a target image sequence. By the method, any two image sequences are spliced through the target image without determining the segment sequence between the image sequences, so that the non-sequence splicing among a plurality of videos is realized, and the pause in the alternate playing process of the videos is reduced.

Description

Video processing method and device, readable storage medium and electronic equipment

Technical Field

The invention relates to the field of video processing, in particular to a video processing method, a video processing device, a readable storage medium and electronic equipment.

Background

With the development of internet application, online teaching, intelligent customer service and the like are more and more widely used in daily life of people, and in the face of different application scenes, a plurality of videos need to be recorded in advance sometimes, and then the videos are spliced to obtain a long video. For example: in the recorded and broadcast scene of online teaching, a plurality of recorded and broadcast segments need to be spliced into a complete teaching video.

In the process of video stitching in the prior art, due to insufficient smoothness, a click feeling may be caused, for example: the direct splicing can cause the problems of transient shift, flash and the like between the first video and the second video, so that the user experience effect is poor; moreover, in the prior art, the video splicing sequence is fixed and cannot adapt to the dynamic interaction condition, for example, a dynamic interaction link exists in a course of online teaching, students can bring out various problems without sequence, and how to splice a plurality of videos prerecorded by a teacher can realize barrier-free communication with the students in the dynamic interaction process is a problem to be solved at present.

Disclosure of Invention

In view of this, embodiments of the present invention provide a method and an apparatus for video processing, a readable storage medium, and an electronic device, which implement non-sequential splicing among multiple videos and reduce the pause in the alternate playing process of the videos.

In a first aspect, an embodiment of the present invention provides a method for video processing, where the method includes:

acquiring at least two image sequences;

determining a transition image set according to the at least two image sequences, wherein the transition image set comprises a plurality of transition images, and the transition images are images positioned at a preset position of a starting part or a last part in the image sequences;

determining a target image according to the transition image set;

and splicing the at least two image sequences based on the target image to obtain a target image sequence.

In one embodiment, the determining a target image according to the transition image set specifically includes:

and determining an image with the minimum difference with the rest images in the transition image set as the target image.

In an embodiment, the predetermined position is a start position or an end position of the image sequence.

In one embodiment, the at least two image sequences comprise a first image sequence and a second image sequence, the stitching order is that the first image sequence is before and the second image sequence is after;

the splicing processing of the at least two image sequences based on the target image to obtain a target image sequence specifically includes:

replacing an image corresponding to the ending position of the first image sequence and an image corresponding to the starting position of the second image sequence with the target image;

and splicing the replaced first image sequence and the replaced second image sequence into a target image sequence.

In one embodiment, the method further comprises:

smoothing the image sequence of the last part with set time length in the replaced first image sequence;

and smoothing the image sequence of the starting part of the set time length in the replaced second image sequence.

In an embodiment, the smoothing the image sequence of the last part of the replaced first image sequence with a set duration specifically includes:

determining a time stamp of a first image, wherein the time stamp of the first image is earlier than that of the target image in the replaced first image sequence;

determining the similarity of the first image and the target image;

in response to the similarity meeting a preset similarity condition, determining a corresponding forward optical flow field vector and a reverse optical flow field vector according to the first image and the target image based on a pre-trained first model, wherein the forward optical flow field vector is used for representing forward optical flows of the first image and the target image, and the reverse optical flow field vector is used for representing reverse optical flows of the first image and the target image;

and determining at least one intermediate image according to the forward optical flow field vector and the reverse optical flow field vector based on a pre-trained second model, wherein the intermediate image is an image between the first image and the target image.

In an embodiment, the determining the similarity between the first image and the target image specifically includes:

calculating an optical flow similarity of the first image and the target image as the similarity.

In one embodiment, the determining at least one intermediate image according to the forward optical flow field vector and the reverse optical flow field vector based on a pre-trained second model specifically includes:

determining a first approximate optical flow field vector and a second approximate optical flow field vector according to the forward optical flow and the backward optical flow, wherein the first approximate optical flow field vector is used for representing the forward approximate optical flow at the intermediate moment, and the second approximate optical flow field vector is used for representing the backward approximate optical flow at the intermediate moment;

determining a first interpolation function according to the first image and the first approximate optical flow field vector;

determining a second interpolation function according to the target image and the second approximate optical flow field vector;

determining a forward visual field, a backward visual field, a first increment and a second increment of the intermediate time with the first image, the target image, the first approximate optical flow field vector, the second approximate optical flow field vector, the first interpolation function and the second interpolation function as the input of the second model, wherein the first increment is used for representing the increment of the first approximate optical flow field vector at the intermediate time, and the second increment is used for representing the increment of the second approximate optical flow field vector at the intermediate time;

and determining the intermediate image corresponding to the intermediate time according to the first image, the target image, the forward visual image, the reverse visual image, the first increment and the second increment.

In one embodiment, the first model and the second model are trained by:

acquiring a plurality of image groups, wherein each image group comprises a first image sample, a target image sample and at least one intermediate image sample, and the first image sample, the target image sample and the intermediate image sample are different images in the same image sequence;

and taking the first image sample and the target image sample in each image group as input, taking the corresponding intermediate image sample as a training target, and simultaneously training the first model and the second model until the loss functions of the first model and the second model are converged.

In one embodiment, the loss function is used to characterize reconstruction loss, semantic loss, warping loss, and smoothing loss of the first and second models.

In one embodiment, the method further comprises:

and determining a target video according to the target image sequence and the corresponding audio sequence.

In a second aspect, an embodiment of the present invention provides an apparatus for video processing, where the apparatus includes:

an acquisition unit for acquiring at least two image sequences;

a determining unit, configured to determine a transition image set according to the at least two image sequences, wherein the transition image set includes a plurality of transition images, and the transition images are images of the image sequences located at predetermined positions of a beginning part or a last part;

the determining unit is further configured to determine a target image according to the transition image set;

and the processing unit is used for splicing the at least two image sequences based on the target image to obtain a target image sequence.

In one embodiment, the determining unit is specifically configured to:

the processing unit is specifically configured to:

In an embodiment, the processing unit is further configured to smooth the image sequence of the last part of the replaced first image sequence with a set duration;

In one embodiment, the processing unit is specifically configured to:

determining the similarity of the first image and the target image;

In one embodiment, the processing unit is specifically configured to:

In one embodiment, the determining unit is further configured to:

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium on which computer program instructions are stored, which when executed by a processor implement the method according to the first aspect or any one of the possibilities of the first aspect.

In a fourth aspect, an embodiment of the present invention provides an electronic device, including a memory and a processor, the memory being configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the method according to the first aspect or any one of the possibilities of the first aspect.

The embodiment of the invention obtains at least two image sequences; determining a transition image set according to the at least two image sequences, wherein the transition image set comprises a plurality of transition images, and the transition images are images positioned at a preset position of a starting part or a last part in the image sequences; determining a target image according to the transition image set; and splicing the at least two image sequences based on the target image to obtain a target image sequence. By the method, any two image sequences are spliced through the target image without determining the segment sequence between the image sequences, so that the non-sequence splicing among a plurality of videos is realized, and the pause in the alternate playing process of the videos is reduced.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent from the following description of the embodiments of the present invention with reference to the accompanying drawings, in which:

FIG. 1 is a flow diagram of a method of video processing according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a structure of a target image sequence according to an embodiment of the present invention;

FIG. 3 is a flow chart of a method of video processing according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a first image sequence structure according to an embodiment of the present invention;

FIG. 5 is a flow chart of a method of determining an intermediate image according to an embodiment of the present invention;

FIG. 6 is a schematic view of an optical flow field at an intermediate time of an embodiment of the present invention;

FIG. 7 is a schematic illustration of image interpolation according to an embodiment of the invention;

FIG. 8 is a schematic diagram of the structure of a first image sequence and a second image sequence according to an embodiment of the present invention;

FIG. 9 is a flow chart of training a first model and a second model according to an embodiment of the present invention;

FIG. 10 is a schematic diagram of an apparatus for video processing according to an embodiment of the invention;

fig. 11 is a schematic diagram of an electronic device of an embodiment of the invention.

Detailed Description

The present disclosure is described below based on examples, but the present disclosure is not limited to only these examples. In the following detailed description of the present disclosure, certain specific details are set forth. It will be apparent to those skilled in the art that the present disclosure may be practiced without these specific details. Well-known methods, procedures, components and circuits have not been described in detail so as not to obscure the present disclosure.

Further, those of ordinary skill in the art will appreciate that the drawings provided herein are for illustrative purposes and are not necessarily drawn to scale.

Unless the context clearly requires otherwise, throughout this specification, the words "comprise", "comprising", and the like are to be construed in an inclusive sense as opposed to an exclusive or exhaustive sense; that is, what is meant is "including, but not limited to".

In the description of the present disclosure, it is to be understood that the terms "first," "second," and the like are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. In addition, in the description of the present disclosure, "a plurality" means two or more unless otherwise specified.

In the prior art, in an application scene of online teaching, some course teachers need to record course contents in advance, for example, videos are recorded through a camera, a mobile phone with a photographing function, a tablet personal computer and other image acquisition devices, and if the teachers record videos required in a dynamic interaction classroom in advance, the dynamic interaction classroom is characterized in that the sequence of questions submitted by students is changed, so that a short video needs to be recorded separately for each question, and then the short videos recorded separately are spliced according to the sequence of questions submitted by the students to obtain a complete course video, so that complete interaction between the teachers and the students is realized; the order of questions posed by students is changed, and the splicing order of videos in the prior art is fixed, so that the dynamic interactive classroom condition cannot be adapted; in the process of playing the complete course video, the smoothness between the video segments is low, that is, the coincidence degree of the last frame image of the previous video segment and the first frame image of the adjacent next video segment is not high, so that the transition of the images is relatively rigid in the process of alternately playing the two adjacent video segments, thereby presenting a transient or pause effect and causing poor watching experience effect of a user; in the existing splicing technology, to solve the above problem, an image frame is inserted between a short video and a short video to implement smoothing processing, and an image frame to be inserted is generally determined based on a previous video and a subsequent video, so that a splicing sequence is fixed, for example: splicing A, B, C three videos, wherein the sequence of A, B, C needs to be determined in advance, if the sequence of A, B, C is determined to be A, B and C, image frames which need to be inserted between A and B and between B and C are respectively determined according to the sequence of the three videos, and splicing is achieved, namely only fixed-sequence splicing can be achieved; in conclusion, how to splice a plurality of videos prerecorded by a teacher can realize barrier-free communication with students in a dynamic interaction process, and guarantee that the user experience is the problem to be solved at present.

In the embodiment of the invention, any two image sequences are spliced through the target image without determining the segment sequence between the image sequences, thereby realizing the non-sequential splicing among a plurality of videos and reducing the pause in the alternate playing process of the videos; because can the unordered concatenation, consequently can solve when online teaching student and propose the order of problem and communicate with teacher's accessible under the condition that changes, according to the problem that the student provided, confirm the short video (image sequence) that this problem corresponds promptly, then propose the order of problem according to the academics and splice the short video that every problem corresponds, at the student angle, the teacher answers the problem smoothly, has improved student's use and has experienced.

In the embodiment of the present invention, fig. 1 is a flowchart of a method for video processing according to the embodiment of the present invention. As shown in fig. 1, the method specifically comprises the following steps:

step S100, at least two image sequences are acquired.

In one or more embodiments, the image sequence may also be referred to as a video clip, a short video, a video, and the like, which is not limited by the embodiments of the present invention.

For example, assuming that the image sequence is a teaching short video of a teacher, each segment video is used for solving a question of a student, the number of questions that the student may present is multiple, and therefore, the teacher needs to record multiple teaching segment videos.

Step S101, determining a transition image set according to the at least two image sequences, wherein the transition image set comprises a plurality of transition images, and the transition images are images positioned at a preset position of a beginning part or a last part in the image sequences.

In one or more embodiments, at least one transition image is obtained in each image sequence, where the transition image may be an image corresponding to a start position of an image sequence, an image corresponding to an end position of the image sequence, and may also be an image in a start portion of the image sequence, for example, at least one of images within 0.5 seconds of the start of the image sequence, and may also be an image in a last portion of the image sequence, for example, at least one of images within 0.5 seconds of the end of the image sequence.

For example, assume that there are 3 image sequences, one image is obtained at a start position and an end position of each image sequence in the 3 image sequences, the 3 image sequences are respectively an image sequence 1, an image sequence 2 and an image sequence 3, the image at the start position of the image sequence 1 is an image a, the image at the end position of the image sequence 1 is an image B, the image at the start position of the image sequence 2 is an image C, the image at the end position of the image sequence 2 is an image D, the image at the start position of the image sequence 3 is an image E, and the image at the end position of the image sequence 3 is an image F, and the image a, the image B, the image C, the image D, the image E and the image F form a transition image set.

And S102, determining a target image according to the transition image set.

In one or more embodiments, the determining a target image according to the transition image set specifically includes: and determining an image with the minimum difference with the rest images in the transition image set as the target image.

In one or more embodiments, it is assumed that there are 3 transition images in the transition image set, and the content in the transition images is the upper half image of a person, where the head of the person in the transition image 1 is biased to the left, the head of the person in the transition image 2 is centered, and the head of the person in the transition image 3 is biased to the right, so that the transition image 2 in the 3 transition images is the image with the smallest difference of the other transition images, and the transition image 2 is determined as the target image.

For example, as in step S101, the transition image set is composed of the image a, the image B, the image C, the image D, the image E, and the image F, and the image C is determined as the target image according to the principle of minimum image difference.

And S103, splicing the at least two image sequences based on the target image to obtain a target image sequence.

In one or more embodiments, when the at least two image sequences are subjected to the stitching processing based on the target image, the order of the image sequences is not limited, if two image sequences exist, the two image sequences are a first image sequence and a second image sequence, respectively, and the stitching order may be that the first image sequence is before the second image sequence; the second image sequence may be preceded and the first image sequence may be followed.

In one or more embodiments, when the first image sequence is in front and the second image sequence is in back, the splicing processing is performed on the at least two image sequences based on the target image to obtain a target image sequence, which specifically includes: replacing an image corresponding to the ending position of the first image sequence and an image corresponding to the starting position of the second image sequence with the target image; and splicing the replaced first image sequence and the replaced second image sequence into a target image sequence.

For example, the step S102 determines that the image C is the target image, and first determines the stitching order of the image sequence 1, the image sequence 2, and the image sequence 3, including: image series 1 to image series 2 to image series 3; image series 2 to image series 1 to image series 3; image series 3 to image series 2 to image series 1; in any case of the image sequence 3 to the image sequence 1 to the image sequence 2, and the like, assuming that the stitching sequence is the image sequence 2 to the image sequence 1 to the image sequence 3, specifically as shown in fig. 2, the image D at the end position of the image 2 is replaced by the image C, the image a at the start position and the image B at the end position of the image sequence 1 are both replaced by the image C, and the image sequence E at the start position of the image sequence 3 is replaced by the image C; and then splicing the replaced image sequences to obtain a target image sequence.

In the embodiment of the invention, by the mode, the sequence of the image sequence does not need to be considered, the target image can be determined firstly, and then the image sequence can be spliced randomly according to the target image, so that the problems that in the prior art, the position of the image sequence needs to be determined firstly, then the transition images among the image sequences are determined according to the position of the image sequence, the random splicing cannot be realized, a plurality of transition images need to be determined, and the calculation is complex are solved.

In the implementation of the present invention, the above-mentioned scheme implements arbitrary image stitching, but since the target image replaces the image at the start position or the image at the end position in the image sequence, a pause or transient shift occurs when the start portion or the last portion of the image sequence is played, and therefore, the image at the start portion or the last portion in the image sequence needs to be smoothed.

In one or more embodiments, after the first image sequence is before, the second image sequence is after, and the at least two image sequences are spliced based on the target image to obtain a target image sequence, the method further includes: smoothing the image sequence of the last part with set time length in the replaced first image sequence; and smoothing the image sequence of the starting part of the set time length in the replaced second image sequence.

In one or more embodiments, the smoothing processing on the image sequence of the last part of the set duration in the replaced first image sequence specifically includes, as shown in fig. 3, the following steps:

step S300, determining a time stamp of a first image, wherein the time stamp of the first image is earlier than that of the target image in the replaced first image sequence.

For example, the timestamp of the first image is 0.5 seconds (S) before the timestamp of the target image in the first image sequence, as shown in fig. 4.

And S301, determining the similarity between the first image and the target image.

In one or more embodiments, the determining the similarity between the first image and the target image specifically includes: calculating an optical flow similarity of the first image and the target image as the similarity; the optical flow is a projection of a spatially (i.e., three-dimensional) moving object onto a pixel viewing plane (i.e., a two-dimensional sequence of images), generated according to the relative speeds of the object and the camera, reflecting the direction and speed of motion of the corresponding image pixels of the object over a very short period of time. The essence of the optical flow is a two-dimensional field of vectors, each vector representing the displacement of a pixel of the image in the scene from a previous frame to a subsequent frame, in this embodiment the displacement of the first image to the target image; specifically, a Total Variation algorithm (TVL 1) based on L1 norm regularization may be selected to calculate the optical flow similarity between the first image and the second image. TVL1 is a cross-platform computer vision and machine learning software library (OpenCV) function based on open source distribution, and is configured to output an optical flow field image, in which each pixel value is used to represent displacement of a corresponding image pixel in the process of moving from a position in the first image to a position in the target image, and displacement in the x direction and the y direction, that is, optical flow similarity is used to represent displacement of the image pixel, by converting the first image and the target image into a grayscale image.

Optionally, other optical flow similarity algorithms may also be used to determine the optical flow similarity between the first image and the target image, for example, a Total Variation algorithm (Total Variation L2-adjustment, TVL2) based on L2 norm regularization, a horns-Schunck (horns-Schunck) method, and the present embodiment is not limited in particular.

Step S302, in response to that the similarity satisfies a predetermined similarity condition, determining, based on a pre-trained first model, a corresponding forward optical flow field vector and a reverse optical flow field vector according to the first image and the target image, where the forward optical flow field vector is used for representing forward optical flows of the first image and the target image, and the reverse optical flow field vector is used for representing reverse optical flows of the first image and the target image.

In one or more embodiments, since the optical flow similarity is used to characterize the displacement of the image pixels, when the displacement is too large, the optical flow similarity is low, and the difference between the first image and the target image is large; when the displacement is too small, the optical flow similarity is high, and the difference between the first image and the target image is small, so that an intermediate image needs to be inserted between the first image and the target image to realize smooth transition, and the predetermined similarity condition is required to be that the optical flow similarity is greater than a first threshold and less than a second threshold.

In one or more embodiments, the first model is used to calculate optical flow, and may be specifically a Convolutional Neural Network (CNN); CNN is a type of feedforward neural network that includes convolution computation and has a deep structure, and is one of the representative algorithms for deep learning. The CNN simulates Visual Perception (Visual Perception) mechanism construction of organisms, can perform supervised learning and unsupervised learning, and enables the CNN to learn grid-like topology features such as pixels and audio with small calculation amount and stable effect without additional feature engineering requirement on data due to hidden layer convolution kernel parameter sharing and sparsity of interlayer connection. More specifically, the first model may be a Cascade Region Convolutional Neural Network (Cascade R-CNN).

Step S303, determining at least one intermediate image according to the forward optical flow field vector and the reverse optical flow field vector based on a pre-trained second model, where the intermediate image is an image between the first image and the target image.

In an embodiment, the determining at least one intermediate image according to the forward optical flow field vector and the reverse optical flow field vector based on the pre-trained second model, as shown in fig. 5, specifically includes the following steps:

s500, determining a first approximate optical flow field vector and a second approximate optical flow field vector according to the forward optical flow and the backward optical flow, wherein the first approximate optical flow field vector is used for representing the forward approximate optical flow at the intermediate moment, and the second approximate optical flow field vector is used for representing the backward approximate optical flow at the intermediate moment.

In one or more embodiments, the intermediate time is a time corresponding to an intermediate image between the first image and the target image. For example, the first image is passed through I₀Is shown by the object image passing through I₁It is shown that, taking the image acquisition period of 0.1 second, the first image being the last 0.5 second image in the first image sequence, and the target image being the last image in the first image sequence, the first image and the target image have a total of 0.3 seconds, so that the intermediate time may take three values of 0.4,0.3, and 0.2.

Because the intermediate image is unknown, the forward optical flow field vector and the reverse optical flow field vector at the intermediate time are difficult to obtain. Fig. 6 is a schematic view of the optical flow field at an intermediate time. As shown in fig. 6, each dot represents a pixel, pixels in the same column correspond to the same time, and pixels in the same row correspond to the same position. For the pixel 21 with T (i.e., the intermediate time), the optical flow field vector corresponding to the pixel 22 can be approximated by a pixel at the same position when T is 0 (i.e., the time corresponding to the first image), that is, a forward optical flow corresponding to the pixel 22, and a pixel at the same position when T is 1 (i.e., the time corresponding to the target image), that is, a reverse optical flow corresponding to the pixel 23. In particular, a first approximation optical-flow field vector may be determined from the forward optical flow corresponding to pixel 22

And determines a first approximate optical flow field vector according to the forward optical flow corresponding to the pixel 23

Specifically, a first approximate optical flow field vector

Can be determined by the following formula:

wherein, F_0→1(p) a forward optical flow field vector, F, for characterizing a pixel at the same position as the pixel at the intermediate time, at time T-0_1→0(p) is used to characterize the inverse optical flow field vector of the pixel at the same location as the pixel at the intermediate time at time T-1.

Similarly, a second approximation optical flow field vector

Can be determined by the following formula:

since the above approximation method works well in smooth regions, but not well near motion boundaries, artifacts may be generated because the motion near the motion boundaries is not locally smooth. Therefore, the server can determine a first interpolation function according to the first image and the first approximate optical flow field vector, and determine a second interpolation function according to the target image and the second approximate optical flow field vector.

S501, determining a first interpolation function according to the first image and the first approximate optical flow field vector.

S502, determining a second interpolation function according to the target image and the second approximate optical flow field vector.

In one or more embodiments, the first interpolation function and the second interpolation function are each bilinear interpolation functions.

In one or more embodiments, the server may be configured to determine the first approximate optical flow field vector based on the first image and the first approximate optical flow field vector, respectivelyDetermining an intermediate image based on the target image and the second approximate optical flow field vector, and then respectively based on the first interpolation function

And a second interpolation function

Bilinear interpolation is performed on the partial intermediate image to obtain the missing elements in the intermediate image.

In one or more embodiments, FIG. 7 is a schematic illustration of image interpolation according to embodiments of the invention. Fig. 7 illustrates the first interpolation function as an example, and it is easily understood that the second interpolation function is determined in a similar manner to the first interpolation function. At the time T ═ 0, the pixel 71, the pixel 72, the pixel 73, and the pixel 74 are four adjacent pixels, and the positional relationship is as shown in the upper left side of fig. 7; at time T, the

pixels

71, 72, 73, and 74 are shifted and not adjacent to each other, and the positional relationship is shown in the upper right of fig. 7. When calculating the missing pixels among the

pixels

71, 72, 73 and 74, such as the pixel 75, at the time T, the server may set the direction in which the pixel 71 points to the pixel 72 as the positive X-axis direction and the direction in which the pixel 71 points to the pixel 73 as the positive Y-axis direction, then set the coordinates of the

pixels

71, 72, 73 and 74 at the time T as (0,0), (1,0), (0,1) and (1,1), respectively, and further determine the pixel value corresponding to the pixel 75 according to the distances between the pixel 75 and the

pixels

71, 72, 73 and 74, that is, the distances X,1-X, Y and 1-Y (where X and Y are real numbers greater than 0 and less than 1). In particular, a first interpolation function

Can be expressed by the following formula:

f(x,y)≈f(0,0)*(1-x)(1-y)+f(1,0)*x(1-y)+f(0,1)*(1-x)y+f(1,1)xy；

where (x, y) is a coordinate value of the pixel 75, f (x, y) is a pixel value of the pixel 75, f (0,0) is a pixel value of the pixel 71, f (1,0) is a pixel value of the pixel 72, f (0,1) is a pixel value of the pixel 73, and f (1,1) is a pixel value of the pixel 74.

And S503, determining a forward visual view, a backward visual view, a first increment and a second increment of the intermediate time by taking the first image, the target image, the first approximate optical flow field vector, the second approximate optical flow field vector, the first interpolation function and the second interpolation function as the input of the second model, wherein the first increment is used for representing the increment of the first approximate optical flow field vector at the intermediate time, and the second increment is used for representing the increment of the second approximate optical flow field vector at the intermediate time.

In one or more embodiments, after determining the first approximate optical flow field vector, the second approximate optical flow field vector, the first interpolation function, and the second interpolation function, the server may obtain, based on the second model, a forward viewable view, a reverse viewable view, a first increment, and a second increment of the target time, with the first image, the first approximate optical flow field vector, the first interpolation function, the second image, the second approximate optical flow field vector, and the second interpolation function as inputs, so as to perform residual error correction on the image to be corrected based on the second model. The forward visual view and the reverse visual view are used for representing the visibility of the target image, the first increment is used for representing the increment of the first approximate optical flow field vector at the target moment, and the second increment is used for representing the increment of the second approximate optical flow field vector at the target moment.

In one or more embodiments, the second model is used to correct the optical flow (i.e., forward and reverse optical flow field vectors) and may likewise be a CNN, and may specifically be a Cascade R-CNN. And similar to the first model, the main architecture of the second model can also be U-Net.

S504, determining the intermediate image corresponding to the intermediate time according to the first image, the target image, the forward visual image, the reverse visual image, the first increment and the second increment.

In one or more embodiments, assuming that three intermediate images are required between the first image and the target image, the timestamp of the first image is 0.5s from the last, the timestamp of the target image is 0.1s from the last, and the corresponding intermediate images are inserted at intermediate times 0.4s, 0.3s, and 0.2s, respectively, to make the first image have a smooth transition to the target image.

Similarly, the smoothing the image sequence of the beginning portion with the set duration in the replaced second image sequence specifically includes: determining a time stamp of a second image, wherein the time stamp of the second image is later than that of the target image in the replaced second image sequence; determining the similarity of the second image and the target image; in response to the similarity meeting a preset similarity condition, determining a corresponding forward optical flow field vector and a reverse optical flow field vector according to the second image and the target image based on a pre-trained first model, wherein the forward optical flow field vector is used for representing forward optical flows of the second image and the target image, and the reverse optical flow field vector is used for representing reverse optical flows of the second image and the target image; and determining at least one intermediate image according to the forward optical flow field vector and the reverse optical flow field vector based on a pre-trained second model, wherein the intermediate image is an image between the second image and the target image.

In summary, after the replaced first image sequence and the replaced second image sequence are spliced, smoothing the last part of the image sequence with a set duration in the replaced first image sequence, that is, replacing the last part of the image in the first image sequence; performing smoothing processing on the image sequence of the beginning part with the set duration in the replaced second image sequence, namely replacing the image of the beginning part in the second image sequence; and finally obtaining the target image sequence.

For example, as shown in fig. 8, an image of 0.5s at the last part of the first image sequence is replaced by a combination of the intermediate image (which may also be referred to as a smooth frame) after the smoothing process and the target image (which may also be referred to as a standard frame); the 0.5s image at the beginning of the second image sequence is replaced by a combination of the target image (which may also be referred to as a standard frame) and the smoothed intermediate image (which may also be referred to as a smoothed frame).

In one or more embodiments, after completion of stitching of a target image sequence, a target video is determined from the target image sequence and a corresponding audio sequence; specifically, the audio sequences are sequenced according to the sequence of each image sequence in the target image sequence, and the sequenced audio sequences are spliced into a target audio sequence; and then synthesizing the target image sequence and the target audio sequence, and outputting a video file.

In one or more embodiments, the first model and the second model are trained in an unsupervised manner, and the server may train the first model and the second model simultaneously. FIG. 9 is a flow chart for training the first model and the second model. As shown in fig. 9, the first model and the second model may be trained as follows:

step S900, obtaining a plurality of image groups, wherein each image group comprises a first image sample, a target image sample and at least one intermediate image sample, and the first image sample, the target image sample and the intermediate image sample are different images in the same image sequence.

In one or more embodiments, the intermediate image sample is an intermediate image sample with a timestamp between the first image sample and the target image sample.

For example, the image sequence as a sample includes an image P1, images P2, …, an image P (m-1), and an image Pm (where m is a predetermined integer greater than 1), the image P1, the images P2, …, the image P (m-1), and the image Pm are arranged in chronological order, the server may determine the image P1 as a first image sample, determine the image Pm as a target image sample, and determine at least one of the images P2-the image P (m-1) as an intermediate image sample; after determining the intermediate image samples, the server may determine the time instants corresponding to the intermediate image samples according to the number of the intermediate image samples. For example, if the number of the intermediate image samples is 3, the server may determine the time corresponding to the intermediate image sample closest to the top in the time axis order as 0.4s, and determine the time corresponding to the intermediate image sample closest to the bottom in the time axis order as 0.2 s.

Step S901, taking the first image sample and the target image sample in each image group as input, taking the corresponding intermediate image sample as a training target, and simultaneously training the first model and the second model until the loss functions of the first model and the second model converge.

In one or more embodiments, in the training process of the first model and the second model, the server may train the first model and the second model simultaneously with the first image sample and the target image sample in each image group as input and a corresponding one of the intermediate image samples as a training target until the loss functions of the first model and the second model converge; for example, the image group G1 includes a first image P1, a target image P2, an intermediate image P31, an intermediate image P32 and an intermediate image P33, and the server may train the first model and the second model simultaneously with the first image P1 and the target image P2 as inputs and the intermediate image P31, the intermediate image P32 and the intermediate image P33 as training targets.

In one or more embodiments, the loss function common to the first model and the second model is composed of four parts, and the four parts are used for respectively representing the reconstruction loss, the semantic loss, the warping loss and the smoothing loss of the first model and the second model. Wherein, the reconstruction loss is used for representing the quality of the intermediate image reconstruction, and the pixel value is usually in the range of [0,255 ]; semantic loss is used for preserving the details of prediction and enhancing the definition of a target image; distortion loss is used to measure the optical flow quality of the target image; the smoothing penalty is used to promote similar optical flow between adjacent pixels. Specifically, the loss function I can be represented by the following formula:

I＝λ_rI_r+λ_pI_p+λ_wI_w+λ_sI_s；

wherein λ is_rFor reconstruction of the loss I_rCorresponding weight, λ_pFor semantic loss I_pCorresponding weight, λ_wTo twist loss I_wCorresponding weight, λ_sTo smooth out the loss I_sThe corresponding weight. Alternatively, λ may be_rSet to 0.8, lambda_pSet to 0.005, λ_wSet to 0.4 and lambda_sIs set to 1. Wherein the reconstruction loss I_rCan be expressed by the following formula:

wherein i is used to characterize the ith intermediate image sample,

for characterizing intermediate image samples predicted at time t, I_tFor characterizing the actual intermediate image sample at time t.

Loss of semantics I_pCan be expressed by the following formula:

where Φ is used to characterize conv4_3 features of the VGG16 model pre-trained on ImageNet datasets.

Distortion loss I_wCan be expressed by the following formula:

smoothing loss I_sCan be expressed by the following formula:

therefore, after the loss function is converged, the first model and the second model can be considered to be trained completely, and an intermediate image with higher smoothness can be obtained according to the first image and the target image in the follow-up process.

Fig. 10 is a schematic diagram of an apparatus for video processing according to an embodiment of the present invention. As shown in fig. 10, the apparatus of the present embodiment includes an acquisition unit 1001, a determination unit 1002, and a processing unit 1003.

The device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring at least two image sequences; a determining unit, configured to determine a transition image set according to the at least two image sequences, wherein the transition image set includes a plurality of transition images, and the transition images are images of the image sequences located at predetermined positions of a beginning part or a last part; the determining unit is further configured to determine a target image according to the transition image set; and the processing unit is used for splicing the at least two image sequences based on the target image to obtain a target image sequence.

Further, the determining unit is specifically configured to:

Further, the predetermined position is a start position or an end position of the image sequence.

Further, the at least two image sequences comprise a first image sequence and a second image sequence, and the stitching order is that the first image sequence is before and the second image sequence is after;

the processing unit is specifically configured to: replacing an image corresponding to the ending position of the first image sequence and an image corresponding to the starting position of the second image sequence with the target image;

Further, the processing unit is further configured to perform smoothing processing on the image sequence of the last part of the set duration in the replaced first image sequence;

Further, the processing unit is specifically configured to:

determining the similarity of the first image and the target image;

Further, the processing unit is specifically configured to:

Further, the determining unit is further configured to:

Fig. 11 is a schematic diagram of an electronic device of an embodiment of the invention. The electronic device shown in fig. 11 is a general purpose video processing apparatus comprising a general purpose computer hardware structure including at least a processor 1101 and a memory 1102. The processor 1101 and the memory 1102 are connected by a bus 1103. The memory 1102 is adapted to store instructions or programs executable by the processor 1101. The processor 1101 may be a stand-alone microprocessor or a collection of one or more microprocessors. Thus, the processor 1101 implements the processing of data and the control of other devices by executing instructions stored by the memory 1102 to thereby perform the method flows of embodiments of the present invention as described above. The bus 1103 connects the above-described components together, as well as connecting the above-described components to the display controller 1104 and the display device and input/output (I/O) device 1105. Input/output (I/O) devices 1105 may be a mouse, keyboard, modem, network interface, touch input device, motion sensing input device, printer, and other devices known in the art. Typically, input/output devices 1105 are connected to the system through an input/output (I/O) controller 1106.

As will be appreciated by one skilled in the art, aspects of the present invention may be embodied as a system, method or computer program product. Accordingly, various aspects of embodiments of the invention may take the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, resident software, micro-code, etc.) or an embodiment combining software and hardware aspects that may all generally be referred to herein as a "circuit," module "or" system. Furthermore, various aspects of embodiments of the invention may take the form of: a computer program product embodied in one or more computer readable media having computer readable program code embodied thereon.

Any combination of one or more computer-readable media may be utilized. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the context of embodiments of the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

A computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated signal may take any of a variety of forms, including, but not limited to: electromagnetic, optical, or any suitable combination thereof. The computer readable signal medium may be any of the following computer readable media: is not a computer readable storage medium and may communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations for aspects of embodiments of the present invention may be written in any combination of one or more programming languages, including: object oriented programming languages such as Java, Smalltalk, C + +, and the like; and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package; executing in part on a user computer and in part on a remote computer; or entirely on a remote computer or server. In the latter scenario, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).

The flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention described above describe various aspects of embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer readable medium that can direct a computer, other programmable data processing apparatus, or other devices to function in a particular manner, such that the instructions stored in the computer readable medium produce an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide processes for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method of video processing, the method comprising:

acquiring at least two image sequences;

determining a target image according to the transition image set;

2. The method according to claim 1, wherein the determining a target image from the set of transition images specifically comprises:

3. The method of claim 1, wherein the predetermined location is a start location or an end location of the sequence of images.

4. The method of claim 3, wherein the at least two image sequences comprise a first image sequence and a second image sequence, the stitching order being first for the first image sequence and second for the second image sequence;

5. The method of claim 4, further comprising:

6. The method according to claim 5, wherein the smoothing the image sequence of the last part of the replaced first image sequence with the set duration specifically comprises:

determining the similarity of the first image and the target image;

7. The method according to claim 6, wherein the determining the similarity between the first image and the target image specifically comprises:

8. The method of claim 1, further comprising:

9. An apparatus for video processing, the apparatus comprising:

an acquisition unit for acquiring at least two image sequences;

10. A computer-readable storage medium on which computer program instructions are stored, which, when executed by a processor, implement the method of any one of claims 1-8.

11. An electronic device comprising a memory and a processor, wherein the memory is configured to store one or more computer program instructions, wherein the one or more computer program instructions are executed by the processor to implement the steps of any of claims 1-8.