CN116801043B

CN116801043B - Video synthesis method, related device and storage medium

Info

Publication number: CN116801043B
Application number: CN202310790573.8A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Beijing Shengshu Technology Co ltd
Current assignee: Beijing Shengshu Technology Co ltd
Priority date: 2022-04-28
Filing date: 2022-04-28
Publication date: 2024-03-19
Anticipated expiration: 2042-04-28
Also published as: CN114900733A; CN114900733B; CN116801043A

Abstract

The embodiment of the application relates to the field of audio and video processing, and provides a video generation method, a related device and a storage medium, wherein the method comprises the following steps: obtaining an audio slice and at least two video slices; obtaining a target slice according to at least two video slices; driving the target fragments by utilizing the audio fragments to obtain driven target fragments; the method comprises the steps that target objects in video clips correspond to at least one specific action, target objects in at least two video clips correspond to different specific actions, target objects in the target clips correspond to at least two specific actions, a first video clip has no frame which is jumped in a first playing period and a second video clip has no frame which is jumped in a second playing period, and the end playing time of the first video clip in the first playing period is identical to the start playing time of the second video clip in the second playing period. The method and the device can improve the action richness and diversity of the objects in the synthesized video and the image consistency and stability of the video splicing part.

Description

Video synthesis method, related device and storage medium

Technical Field

The present application is a divisional application of patent application with application number 2022104618544.

The embodiment of the application relates to the technical field of audio and video processing, in particular to a video synthesis method, a related device and a storage medium.

Background

In the related art, in order to record video, a shooting object may be recorded with video by a shooting device. For example, in order to record teaching videos of teachers, the video recording can be performed on teaching processes of the teachers through the shooting device. However, the recording process requires a high level of requirements for the subject, the person, the equipment, the field, etc., resulting in high recording costs. Related technology can adopt video composition technology to generate teaching video to reduce video recording cost.

In the course of research and practice of the prior art, the inventors of the present application found that, in order to enhance the authenticity of a photographed object in a video, a background video including the photographed object may be employed to generate a desired video. In order to reduce the shooting difficulty and shooting cost of the background video, video clips with shorter playing time can be shot, and then the required background video is generated in a video clip splicing mode. However, the motion of the object in the stitched video slices is single. In addition, when video segmentation concatenation is in and plays, the bandwagon effect is lower, if appear image shake, image jump etc. easily.

Disclosure of Invention

The embodiment of the application provides a video generation method, a related device and a storage medium, which can improve the action richness and diversity of objects in a synthesized video and the image consistency and stability of a video splicing part.

In a first aspect, a video generating method provided in an embodiment of the present application includes: obtaining an audio fragment and at least two video fragments, wherein the playing time of the audio fragment is at least longer than the playing time of each video fragment in the at least two video fragments; obtaining a target fragment according to at least two video fragments, wherein the playing time length of the target fragment is longer than or equal to the playing time length of the audio fragment; driving the target fragments by utilizing the audio fragments to obtain driven target fragments; the method comprises the steps that target objects in video clips correspond to at least one specific action, target objects in at least two video clips correspond to different specific actions, target objects in the target clips correspond to at least two specific actions, a first video clip has no frame which jumps in a first playing period and a second video clip has no frame which jumps in a second playing period, the starting playing time of the first playing period of the first video clip is later than the playing starting time of the first video clip, the ending playing time of the second playing period of the second video clip is earlier than the ending playing time of the second video clip, and the ending playing time of the first video clip in the first playing period is identical to the starting playing time of the second video clip in the second playing period.

In one possible design, generating the complementary frame video slice Bij of the motion video slice Vi for the motion video slice Vj may include the following operations: firstly, obtaining the last video frame of an action video slice Vi and the first video frame of an action video slice Vj; then, obtaining a plurality of supplementary video frames between the last video frame of the action video slice Vi and the first video frame of the action video slice Vj through a supplementary frame algorithm, so that the last video frame of the action video slice Vi, the first video frame of the action video slice Vj and no jump frame exist between the last video frame of the action video slice Vi and the first video frame of the action video slice Vj; and then, combining a plurality of complementary video frames according to a preset frame rate to obtain complementary frame video clips Bij, wherein the playing time length of the complementary frame video clips Bij is a designated time length.

In one possible design, determining a mapping relationship between each playback period of an audio clip and each of at least two video clips includes: receiving calibration information, wherein the calibration information comprises a corresponding relation between a playing period of an audio fragment and a video fragment; and determining the mapping relation between each playing period of the audio fragments and each of at least two video fragments based on the calibration information.

In one possible design, if f2/f1 is a decimal, and The audio frame at the end playing time of the first playing period and the audio frame at the start playing time of the second playing period are overlapped; if f2/f1 is a decimal fraction, and +.>The audio frame at the end play time of the first play session and the audio frame at the start play time of the second play session are separated from each other.

In one possible design, obtaining a target slice from at least two video slices includes: for each video slice in at least two video slices, performing frame extraction on the video slice to obtain a video frame sequence; sequencing at least two video frame sequences according to a preset video slicing playing sequence to obtain a combined video frame sequence; and combining the combined video frame sequences according to a preset frame rate to generate the target fragments.

In one possible design, when the values of i and j in the complementary frame motion video slices Vij are the same, the video frame sequences in two adjacent complementary frame motion video slices Vij are arranged in reverse order.

In one possible design, driving the target slice with the audio slice for the P-th video frame in the target slice, the obtaining the driven target slice may include: adjusting a mouth image of a P-th video frame in the target slice based at least on audio features of the Q x P-th audio frame to the (Q x (P + 1) -th audio frame to obtain a driven P-th video frame, wherein P is an integer greater than or equal to 0, Is rounded up or +> To round down, f1 is the frame rate of the target tile and f2 is the frame rate of the audio tile.

In one possible design, the method may further include: and outputting audio and the driven target fragments to perform at least one function of teaching and guiding.

In one possible design, the video frames of the first video clip during the first playback period are different from the video frames of the second video clip during the second playback period.

In a second aspect, an embodiment of the present application provides a video generating apparatus having a function of implementing a video generating method corresponding to the first aspect. The functions may be realized by hardware, or may be realized by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, and the modules may be software and/or hardware.

In a third aspect, embodiments of the present application provide a video processing apparatus disposed in a server and/or a client.

In one possible design, the apparatus includes: the device comprises a video fragment obtaining module, a target fragment obtaining module and a driving module. The video slicing obtaining module is used for obtaining at least two video slices; the target fragment obtaining module is used for obtaining target fragments according to at least two video fragments; the driving module is used for driving the target fragments by utilizing the audio fragments to obtain the driven target fragments. The method comprises the steps that target objects in video clips correspond to at least one specific action, target objects in at least two video clips correspond to different specific actions, target objects in the target clips correspond to at least two specific actions, a first video clip has no frame which jumps in a first playing period and a second video clip has no frame which jumps in a second playing period, the starting playing time of the first playing period of the first video clip is later than the playing starting time of the first video clip, the ending playing time of the second playing period of the second video clip is earlier than the ending playing time of the second video clip, and the ending playing time of the first video clip in the first playing period is identical to the starting playing time of the second video clip in the second playing period.

A further aspect of the embodiments of the present application provides a video processing apparatus, which includes at least one connected processor, a memory and an input/output module, wherein the memory is configured to store a computer program, and the processor is configured to invoke the computer program in the memory to perform the method provided in the foregoing first aspect, and the various possible designs of the first aspect.

Yet another aspect of the embodiments provides a computer readable storage medium comprising instructions which, when run on a computer, cause the computer to perform the method provided in the above-described first aspect, various possible designs of the first aspect.

Compared with the prior art, in the scheme provided by the embodiment of the application, at least two video clips are adopted to generate the target clip, wherein the two video clips can respectively comprise images of different types of specific actions aiming at the target object, so that the generated target clip can comprise at least two specific actions corresponding to the target object, and the action richness and diversity of the target object in the target clip are effectively improved. In addition, the frame without jump in the first playing period and the second playing period is beneficial to realizing higher consistency and stability of the played images of the first video clips in the first playing period and the second video clips in the second playing period.

Drawings

Fig. 1 is a schematic diagram of a server according to an embodiment of the present application;

fig. 2 is a schematic view of an application scenario provided in an embodiment of the present application;

FIG. 3 is a schematic flow chart of a video generating method according to an embodiment of the present application;

fig. 4 is a schematic diagram of motion video slicing, frame-complement video slicing, and frame-complement motion video slicing in an embodiment of the present application;

FIG. 5 is a schematic diagram of a process of generating a frame-complement action video slice in an embodiment of the present application;

FIG. 6 is a schematic diagram of a reference action, a specific action in an embodiment of the present application;

fig. 7 is a schematic diagram of a video frame in an action video slice according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a complementary frame in an embodiment of the present application;

fig. 9 is a schematic diagram of a generation process of a complementary frame in an embodiment of the present application;

fig. 10 is a schematic diagram of a correspondence between audio slices and video slices in an embodiment of the present application;

FIG. 11 is a schematic diagram of a method for intercepting redundant audio clips according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a process for driving video images in an embodiment of the present application;

FIG. 13 is a schematic diagram of outputting a driven target tile in an embodiment of the present application;

fig. 14 is a schematic structural diagram of an entity apparatus for performing a video generating method according to an embodiment of the present application;

Fig. 15 is a schematic structural diagram of a server according to an embodiment of the present application.

Detailed Description

The terms first, second and the like in the description and in the claims of the embodiments and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those explicitly listed but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, such that the partitioning of modules by embodiments of the application is only one logical partitioning, such that a plurality of modules may be combined or integrated in another system, or some features may be omitted, or not implemented, and further that the coupling or direct coupling or communication connection between modules may be via some interfaces, such that indirect coupling or communication connection between modules may be electrical or other like, none of the embodiments of the application are limited. The modules or sub-modules described as separate components may or may not be physically separate, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purposes of the embodiments of the present application.

Digital personal technology requires the use of different mouth shapes to synchronize different audio information in order to generate realistic digital personal video. In particular, a link between the audio signal and the digital person's mouth shape needs to be established. For example, audio features (e.g., phonemes, energy, etc.) may be mapped to video features (e.g., mouth-shaped features). Artificial intelligence (Artificial Intelligence, AI for short) can automatically learn the mapping between audio features and video type features. For example, the mapping relationship between audio features and video features may be constructed based on machine learning techniques.

In order to improve the reality of the target person in the digital person video, for example, improve the reality restoration degree of the face of the teaching teacher, the digital person video can be generated by adopting the background video comprising the target person. The length of audio in digital human video can be determined by the recording time length or the text length of specific text. The length of the audio may be relatively long, such as 40 minutes, 1 hour, or longer, etc. In order to ensure that the length of the background video is not shorter than the length of the audio in order to synthesize the digital person video, the target person is required to keep a specific posture continuously during the recording of the background video. This way of recording background video places a great physical and mental burden on the target person. In addition, the requirements on the shooting environment are high during background video shooting, such as the situation that the background of the video is prevented from changing as much as possible, and the cost of a shooting place and the like which need to be rented is high.

In order to reduce the shooting difficulty and shooting cost of the background video, video clips with shorter lengths, such as 10 seconds, 30 seconds, 1 minute, 3 minutes or 10 minutes, can be shot, and then the required background video is generated in a video clip splicing mode. However, the poses of the persons in different video clips may be different, particularly the poses of the photographed objects of the ending period of the current video clip to be spliced and the starting period of the next video clip are different, resulting in inconvenience in video frequency splicing. In addition, the gesture of the target person in the background video is inevitably changed (such as slight shaking, etc.), and when the spliced video clips are played, the video display effect at the spliced position is poor, such as image shake, image jump, etc. are easy to occur.

The embodiment of the application provides a video generation method, a related device and a storage medium, which can be used for a server or terminal equipment. The gesture and the like of the target object in the video slicing are limited, and the defects that the display effect at the splicing position does not meet the requirement of a user due to video slicing splicing can be effectively reduced by means of frame supplementing and the like.

The scheme of the embodiment of the application can be realized based on cloud technology, artificial intelligence technology and the like, and particularly relates to the technical fields of cloud computing, cloud storage, databases and the like in the cloud technology, and the technical fields are respectively described below.

Fig. 1 is a schematic diagram of a server according to an embodiment of the present application. It should be noted that fig. 1 is only an example of a system architecture to which the embodiments of the present application may be applied to help those skilled in the art understand the technical content of the present application, and does not mean that the embodiments of the present application may not be used in other devices, systems, environments, or scenarios.

Referring to fig. 1, a system architecture 100 according to the present embodiment may include a plurality of servers 101, 102, 103. Wherein different servers 101, 102, 103 may each provide different kinds of services. For example, the server 101 may provide a text recognition service, the server 102 may provide a speech synthesis service, and the server 103 may provide an image processing service.

For example, the server 101 may transmit text recognized from an image to the server 102 to synthesize an audio clip corresponding to the text. The server 103 may perform image processing on the received video slices. Such as server 103, may receive at least two video slices and obtain a target slice from the at least two video slices. In addition, the server 103 may generate a complementary frame video slice for the motion video slice and the motion video slice, so as to reduce the image jump at the splicing position of the video slices. In addition, the received audio fragments are utilized to drive the target fragments, and the driven target fragments and other functions are obtained. The server 103 may also send the driven target slices, the generated mouth images, the driven video frames, etc. to the terminal device in order to present the above information on the terminal device. For example, the terminal device may display the driven video, implement video teaching, and the like. For example, the server 103 may be a background management server, a server cluster, a cloud server, or the like.

The cloud server can realize cloud computing (cloud computing), and cloud technology refers to a delivery and use mode of an IT infrastructure, namely that required resources are obtained in an on-demand and easily-expandable manner through a network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (Distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like.

For example, a cloud server may provide an artificial intelligence cloud Service, also known as AI as a Service (AIaaS for short). The AIaaS platform can split several common AI services and provide independent or packaged services at the cloud. This service mode is similar to an AI theme mall: all developers can access one or more artificial intelligence services provided by the use platform through an API interface, and partial deep developers can also use an AI framework and AI infrastructure provided by the platform to deploy and operate and maintain self-proprietary cloud artificial intelligence services. Fig. 2 is a schematic view of an application scenario provided in the embodiment of the present application.

Taking video stitching in which the target object (e.g., teacher) in both video slices is in a reference motion state (e.g., standing facing the camera and in a non-speaking state) as an example. The target objects (such as teachers) in the two video clips are in a reference motion state, so that the target objects can act singly, and the synthesized video is not lively enough. For example, a teacher is always standing facing the photographing device and has no physical action, which does not conform to the teacher's lecture image in daily teaching tasks. If the target object in the video desired to be synthesized can have various actions, and the actions of the target object in the video clips are matched with the sound information of the audio clips, such as a teacher saying "next to explain the key content … …", if the teacher in the video can show the actions of nodding, waving arms, etc., the audio clips can be more matched with respect to the show reference actions. However, how to splice together video slices showing different actions, and not to generate image jump, etc. becomes a technical problem to be solved.

In addition, even if the target objects in different video clips are in the reference action state, the problems of image shake and the like still occur in the spliced target clips. Referring to fig. 2, only a head image of a target object is shown, and an arrangement of video frames in a spliced video clip in the related art and an image of a lecturer teacher shown when the spliced video clip is played are shown. Such as a facial image of a lecturer teacher, is shown in the image.

For example, video slice 1 (abbreviated slice 1) includes video frame a0 through video frame a (N-1), video slice 2 (abbreviated slice 2) includes video frame a0 through video frame a (N-1), and video slice 3 (abbreviated slice 3) includes video frame a0 through video frame a (N-1). In addition, more or fewer slices may be included in the stitched video, and the number of video frames in each slice may be the same or different. In the related art, when splicing the fragments 1, 2 and 3, the splicing may be performed in a positive sequence splicing manner. Since the teacher is recording the background video, it is almost impossible to keep the posture completely unchanged all the time. Adjacent two frames at the slice splice, such as the image between video frame a (N-1) of slice 1 and video frame a0 of slice 2 in fig. 2, are almost unlikely to be identical, which results in the image of play video frame a (N-1) being shown as a solid line face image in the lower diagram of fig. 2 and the image of play video frame a0 being shown as a broken line face image in the lower diagram of fig. 2. Because of the difference between the two, the problems of image shake and the like occur when the video is played, and the playing effect is affected.

At least part of the technical scheme aims at that after the fragments 1, 2 and 3 are spliced, when the video frames at the spliced position are played, the actions of the target objects can be smoothly connected, and the defects of video image jitter and the like are effectively eliminated.

The following describes an exemplary embodiment of the present application with reference to fig. 3 to 15.

Fig. 3 is a schematic flow chart of a video generating method in an embodiment of the present application. The video generation method can be executed by a server side. The video generation method may also be performed by a client. In addition, part of operations of the video generating method may be performed by the server side, and part of operations may be performed by the client side, which is not limited herein.

Referring to fig. 3, the video generating method may include operations S310 to S330.

In operation S310, an audio clip and at least two video clips are obtained, and a play duration of the audio clip is at least longer than a play duration of each of the at least two video clips.

In this embodiment, the target objects in the video clips correspond to at least one specific action, and the target objects in at least two video clips correspond to different specific actions. For example, the teacher in the video clip 1 is always in the state of the reference motion. The teacher in the video clip 2 is in a state of waving his hand. For another example, the teacher in the video clip 2 starts to be in the reference motion state, then in the waving motion state, and then in the reference motion state. For another example, the teacher in the video clip 2 starts to be in the reference motion, then in the waving motion, and then in the nodding motion. For another example, the teacher in the video clip 2 starts to be in the state of the reference motion, then in the state of the waving motion, then in the state of the nodding motion, and then in the state of the reference motion.

The terminal device can send the collected or edited video clips to the server. The actions performed by the target objects in the at least two video slices may each be the same or different. The video clips may be obtained by shooting, or may be obtained by clipping. For example, a portion of a video frame is selected from a plurality of frames of video including the target person to obtain the video clip. The audio clips may be audio collected by a sound sensor or audio obtained by speech synthesis. Because the collection difficulty of audio with longer play duration is lower than the collection difficulty of video clips with longer play duration, the play duration of an audio clip may be longer than the play duration of a video clip.

In operation S320, a target slice is obtained according to at least two video slices, where the playing time period of the target slice is longer than or equal to the playing time period of the audio slice.

In this embodiment, the target objects in the target fragment correspond to at least two specific actions. That is, the target object in the target segment can implement various specific actions, which is helpful to promote the action richness and diversity of the target object in the target segment, promote the naturalness of the action of the target object in the synthesized video, and promote the consistency of the video content and the public awareness.

The method comprises the steps that a first video clip is in a first playing period and a second video clip is in a frame without jumping in a second playing period, the starting playing time of the first playing period is later than the playing starting time of the first video clip, the ending playing time of the second playing period is earlier than the ending playing time of the second video clip, and the ending playing time of the first playing period is identical to the starting playing time of the second playing period.

Specifically, the problem of jumping frames can be eliminated by setting the constraint of the starting action and the ending action of the video slicing, carrying out frame supplementing at the splicing position of the video slicing, a special video slicing splicing mode and the like.

For example, a reference action can be set, and when the starting time period and the ending time period of each video segment are agreed, the target object is in the reference action, so that the problem that the target object in different video segments is in different specific actions, which causes obvious jitter at the video splicing position can be effectively solved. For example, the target object in the motion video clip performs a reference motion and at least one specific motion, and the target object in the motion video clip performs a reference motion in the first playback period and the second playback period, the reference motion being the same or different from the specific motion.

Accordingly, obtaining at least two video slices may include: firstly, at least two motion video clips Vi are obtained from a second material library, wherein the second material library comprises n motion video clips Vi, a target object in each motion video clip Vi implements a reference motion and at least one specific motion, and for each of at least part of motion video clips Vi, the target object of the motion video clip Vi implements the reference motion in a first playing period and a second playing period, and the reference motion and the specific motion are the same or different. Then, at least two motion video slices Vi are set as the at least two video slices. Where n is an integer greater than or equal to 2 and i is an integer greater than or equal to 1. In the application scenario, the video frames of the first video clip in the first playing period and the video frames of the second video clip in the second playing period may be similar, no obvious image jump occurs, but a slight image jitter problem may still occur at the video splicing position.

For example, a complementary frame may be generated for any two video slices that may allow for motion consistency of the target object at the splice of the different video slices. For example, the image of the target object suddenly jumps from position 1 in the current video frame to position 2 in the next video frame, the larger the gap between position 1 and position 2, the more serious the image shake, see fig. 2. In the application scene, the video frames of the first video clip in the first playing period are different from the video frames of the second video clip in the second playing period.

For example, the video frame of the first video clip during the first playback period approaches the first video frame of the second video clip frame by frame. Specifically, the first similarity is greater than the second similarity, where the first similarity is a similarity between a subsequent video frame of the first video clip in the first playing period and a first video frame of the second video clip, and the second similarity is a similarity between a preceding video frame of the first video clip in the first playing period and a first video frame of the second video clip, and the plurality of video frames of the first video clip and the second video clip are each arranged in time sequence, where the preceding video frame is arranged before the subsequent video frame. For another example, the similarity between the video frame of the second video clip at the second playback period and the last video frame of the first video clip decreases from frame to frame. Specifically, the third similarity is greater than the fourth similarity, where the third similarity is a similarity between a preceding video frame of the second video clip and a last video frame of the first video clip, and the fourth similarity is a similarity between a following video frame of the second video clip and a last video frame of the first video clip, the plurality of video frames of the first video clip and the second video clip each being arranged in time sequence, the preceding video frame being arranged before the following video frame.

For example, the image jitter in the following cases can be reduced by a special video slice stitching approach: if the actions of the target objects in the two video clips are the same and are both in the reference action, the problem of image jitter can be improved by adjusting the playing sequence of the video frame sequences in the video clips. In the application scene, at least two adjacent frames at the splicing position of the video frame of the first video slice in the first playing period and the video frame of the second video slice in the second playing period are identical, and no image jump occurs. However, the scheme can be applied to fewer application occasions, such as splicing by only adopting one action video slice.

In operation S330, the target tile is driven using the audio tile, resulting in a driven target tile. For example, the mouth shape of the target object in the target tile may be adjusted according to the audio characteristics of at least one frame in the audio tile so that the sound and mouth shape in the driven target tile are more adapted.

The three modes are described below in the following examples.

In some embodiments, the variety of actions in a particular scene due to a target object is typically limited. For example, when a cultural lesson teacher is on a network lesson, actions that may be performed may include: reference motion, nodding motion, waving motion, turning around, etc. The cultural lessons and teachers rarely run, boxing and the like when they go on the network lessons. Thus, motion video clips may be recorded separately for a variety of specific actions that the target object may perform. Then, corresponding complementary frame video clips are generated for any two of the motion video clips. Therefore, when two corresponding action video clips are spliced based on the frame-supplementing video clips, the problems of image jumping and the like can not occur.

In particular, obtaining at least two video slices may include the following operations. Firstly, at least two complementary frame action video clips Vij are obtained from a first material library, wherein the video material library comprises N complementary frame action video clips Vij, and each complementary frame action video clip Vij comprises an action video clip Vi and at least one complementary frame video clip Bij. The complementary frame video slicing Bij causes the motion video slicing Vi to have no frame hopped in the first play period, and the complementary frame video slicing Bij causes the motion video slicing Vj to have no frame hopped in the second play period. Wherein N is an integer greater than or equal to 2, and i and j are integers greater than or equal to 1, respectively. Then, at least two complementary frame motion video slices Vij are taken as at least two video slices.

Fig. 4 is a schematic diagram of motion video slicing, frame-complement video slicing, and frame-complement motion video slicing in an embodiment of the present application.

Referring to fig. 4, motion video slices V1-Vn may be pre-acquired, wherein each video slice may include one or more specific motion. Since the action video slicing is a shot video for the target object, the actions of the target object are coherent, and no jump frame occurs.

Then, for any two of the motion video slices V1 to Vn, a frame-complementary video slice for between the two motion video slices is generated. See the complementary frame video slices B11-Bnn in fig. 4. Advantages of employing pre-generated complementary frame video slices herein may include: for example, the pre-generated complementary frame video clips can be directly called, the video clips are not required to be generated when the video clips are used, and the response speed is high. For example, the number of motion video slices is limited, and the number is not excessive, and the generated complementary frame video slices are also limited, and do not occupy excessive storage space.

And then, the frame-supplementing video clips and the action video clips can be directly stored as a whole, so that the risk of splicing errors is reduced. See the frame-complement motion video slices V11-Vnn in fig. 4. For example, the motion video segment V11 is composed of the motion video segment V1 and the motion video segment B11.

Note that the complementary frame video slice B21 and the complementary frame video slice B12 are two different complementary frame video slices. This is because the motion video clip V1 includes a start play period and an end play period, and the motion video clip V2 also includes a start play period and an end play period. The video frames of the start playing period and the end playing period of the motion video clip V1 are not identical, such as a slight gesture change of the target object. Referring to fig. 2, if the complementary frame video slice B21 is used instead of the complementary frame video slice B12, image hopping as shown in fig. 2 may occur.

That is, the complementary frame video clip B21 interfaces the end play period of the motion video clip V2 with the start play period of the motion video clip V1. The complementary frame video clip B12 interfaces the end play period of the motion video clip V1 with the start play period of the motion video clip V2.

Further, the generation of the complementary frame video slice B11 is for the following reason: the motion video clip V1 includes a start play time period and a stop play time period, and if the motion video clip V1 and the motion video clip V1 are used to splice in order to achieve the effect of increasing the play time period of the target clip, etc., the video frames of the start play time period and the stop play time period of the motion video clip V1 are not identical, such as the target object has a slight gesture change. Referring to fig. 2, if the motion video clip V1 and the motion video clip V1 are directly spliced, image jumping as shown in fig. 2 may occur.

The following describes an exemplary procedure for generating the complementary frame action video slice Vij.

In some embodiments, the above method may further comprise: and constructing and/or updating the first material library.

Specifically, the first material library may be constructed in the following manner.

First, N motion video slices Vi, i+.n, n=n2 are obtained. See fig. 4 for a complementary frame video slice.

Then, for each of the n motion video slices Vi, a complementary frame motion video slice Vij, i, and j of the motion video slice Vi for the motion video slice Vj is generated to be the same or different. See fig. 4 for a complementary frame action video slice.

In addition, the video clip Vij of the frame-filling action can be stored to construct the first material library or update the first material library.

When the first material library is used for video synthesis, a target fragment can be obtained through a video fragment splicing mode, and the method is specifically shown as follows. For example, in order to splice the motion video slice V2 and the motion video slice V4, the frame-complementary motion video slice V24 and the motion video slice V4 may be selected. For example, in order to splice the motion video slice V2, the motion video slice V4, and the motion video slice V1, the frame-complement motion video slice V24, the frame-complement motion video slice V41, and the motion video slice V1 may be selected. The above description is given by way of example in which the complementary frame video clip is added to the motion video clip. In addition, other adding modes can be adopted to generate the frame-supplementing action video clips.

In some embodiments, generating the motion video slice Vi for the complementary frame motion video slice Vij of the motion video slice Vj may include the following operations. First, a complementary frame video slice Bij of the motion video slice Vi for the motion video slice Vj is generated. Then, the frame-supplementing action video clips Vij including the frame-supplementing video clips Bij are obtained by splicing and the like.

The above-described splicing manner may include various manners. Fig. 5 is a schematic diagram of a process of generating a frame-complement action video slice in an embodiment of the present application. Referring to fig. 5, a complementary frame action video clip may be generated using a variety of stitching approaches.

For example, the complementary frame video slice Bij is set after the last video frame of the motion video slice Vi, and the complementary frame motion video slice Vij is obtained. Referring to fig. 5, taking the case that the motion video slice V1 and the motion video slice V2 need to be spliced as an example, after the complementary frame video slice B12 is generated, the complementary frame video slice B12 may be spliced at the rear end of the motion video slice V1 to obtain the complementary frame motion video slice V12, and then the splicing effect of no jump frame between the video slices may be achieved by splicing the complementary frame motion video slice V12 and the motion video slice V2.

For example, the complementary frame video slice Bij is set before the first video frame of the motion video slice Vj, and the complementary frame motion video slice Vij is obtained. Referring to fig. 5, taking the case that the motion video slice V1 and the motion video slice V2 need to be spliced as an example, after the complementary frame video slice B12 is generated, the complementary frame video slice B12 may be spliced at the front end of the motion video slice V2 to obtain the complementary frame motion video slice V12, and then the splicing effect without a jump frame between the video slices may be achieved by the mode that the motion video slice V1 and the complementary frame motion video slice V12 are spliced.

In summary, there may be multiple splicing modes between the motion video clips and the complementary frame video clips, which are not limited herein. For another example, a first portion of the complementary frame video slice Bij is disposed after a last video frame of the motion video slice Vi, and a second portion of the complementary frame video slice Bij is disposed before a first video frame of the motion video slice Vj, to obtain the complementary frame motion video slice Vij. That is, one complementary frame video slice Bij may be split into two halves, which are spliced after the motion video slice Vi and before the motion video slice Vj, respectively. By the processing mode between the action video slicing and the complementary frame video slicing, the complementary frame action video slicing Vij can be spliced with any required complementary frame action video slicing Vjx, and no jump frame exists at the splicing position. Wherein x is more than or equal to 1 and less than or equal to n.

The following is an exemplary description of a procedure and principle of generating the motion video slice Vi for the complementary frame video slice Bij of the motion video slice Vj.

For ease of understanding, reference actions, specific actions, and the like are first illustrated. Fig. 6 is a schematic diagram of a reference action and a specific action in the embodiment of the present application.

Referring to fig. 6, fig. 6 shows three actions, such as the left frame being a reference action: standing facing the photographing device and being in a silent state. If the intermediate frame is the first specific action: standing sideways and in a quieter state. The right frame is a second specific action: stand facing away from the camera and in a silent state. Fig. 6 shows only three actions, and may include more specific actions, such as a tilting action, a lowering action, a lifting action, a nodding action, a turning action, etc., which are not shown one by one.

Fig. 7 is a schematic diagram of a video frame in an action video slice according to an embodiment of the present application. Referring to fig. 7, in a video clip including a sideways standing action, a video frame including a target subject standing sideways is included, as shown in the middle frame in fig. 7. If the video clips including the sideways standing motion and the video clips including the reference motion are spliced, it is necessary to generate a complementary frame video clip for example between the left video frame and the middle video frame, and it is difficult to obtain a continuous and natural complementary frame video clip of motion by a complementary frame method because the motion difference between the two is too large. Thus, a convention may be made for recording video clips that include a sideways standing action: if the target object in the second playing period is in the reference motion, the target object in the first playing period is also in the reference motion, and in the playing period between the second playing period and the first playing period, the target object can be naturally switched from the reference motion to the sideways standing motion and then switched to the reference motion. Therefore, the problem that the motion phase difference of the target object in the two motion video clips to be spliced is overlarge can be effectively solved. In addition, in the recorded video clips including the sideways standing action, the action of the target object is continuously changed, and no frame is jumped inside the video clips. Thus, the frame which is not jumped between the two video clips to be spliced is convenient to be obtained in a frame supplementing mode.

Fig. 8 is a schematic diagram of a complementary frame in an embodiment of the present application.

Referring to fig. 8, the video frame i in fig. 8 may be the last video frame in a video slice, and the head of the target object may be seen to be slightly turned left, and the head of the target object is located in the upper left corner of the video frame i. In fig. 8, video frame i+4 may be the first video frame in the video slices to be spliced, it may be seen that the head of the target object has no twisting action, and the head of the target object is located in the lower right corner of video frame i+4. In order to make the image of the splice between the certain video slice and the video slice to be spliced non-jumping. Video frames i + 1-i +3 may be generated to improve the problem of image transitions between video frame i and video frame i +4. For example, the image of the target object in video frame i+1 through video frame i+3 may gradually transition from the motion of the target object in video frame i to video frame i+4. The auxiliary lines in fig. 8 facilitate the showing of the course of the head image of the target object.

In some embodiments, a complementary frame video slice may be generated by a complementary frame algorithm or the like when there is no excessive difference between the actions of the target objects of the two video frames.

Specifically, generating the complementary frame video slice Bij of the motion video slice Vi for the motion video slice Vj may include the following operations.

First, the last video frame of the motion video slice Vi and the first video frame of the motion video slice Vj are obtained. See left video frame and right video frame in fig. 7.

Then, a plurality of complementary video frames between the last video frame of the motion video slice Vi and the first video frame of the motion video slice Vj are obtained through a complementary frame algorithm, so that the last video frame of the motion video slice Vi, the first video frame of the motion video slice Vj and no jump frame exists between the last video frame and the first video frame of the motion video slice Vj. Specifically, a variety of frame-supplementing algorithms may be used, such as an artificial intelligence-based frame-supplementing algorithm, a preset rule-based frame-supplementing algorithm, and the like, which are not limited herein. The number of frames of the plurality of supplemental video frames may be determined based on fluency, a preset playback duration of the supplemental video clips, and the like.

And then, combining a plurality of complementary video frames according to a preset frame rate to obtain complementary frame video clips Bij, wherein the playing time length of the complementary frame video clips Bij is a designated time length. For example, the playback duration of the complementary frame video clips may be preset to be 0.3 seconds, 0.5 seconds, 0.8 seconds, 1 second, etc., which is not limited herein.

Fig. 9 is a schematic diagram of a complementary frame generating process in an embodiment of the present application. The generation process of the complementary frame is exemplarily described in fig. 9 taking an example in which the image position of the target object in the video frame is changed.

First, a plurality of feature points of a target object in a video frame may be determined, and then a complementary frame video clip is generated based on a positional change relationship between the plurality of feature points in two frames of video frames, the number of frames to be inserted, and the like.

Specifically, first, corresponding feature points in two video frames may be identified, such as calibrating a first position of a pupil in a last video frame in a current video clip, and calibrating a second position of a pupil in a first video frame in the video clip to be spliced.

Then, based on the position change between the corresponding feature points in the two video frames, the motion vectors of the first frame to the second frame, such as the translation amount in the x direction, the translation amount in the y direction, and the rotation angle, are determined.

Then, the motion vector can be processed based on the frame number of the video clips of the complementary frame preset by the user, and the updating position of each characteristic point in each complementary frame is determined, so that the positions of the other pixels of the target object in the video frame can be updated based on the updating position of the characteristic point, and complementary frames 1 to n are generated.

It should be noted that the above frame interpolation algorithm is only exemplary, and is not to be construed as limiting the technical solution of the present application, and a plurality of frame interpolation algorithms may be adopted, for example, frame interpolation algorithm based on artificial intelligence may be adopted to perform frame interpolation, which is not limited herein.

In a particular embodiment, the first library of material may be generated based on the second library of material. Taking a second material library as an example of the action video material library { V1, V2, … …, vn }, where n is the total number of materials. The material short videos in the action material library are different action videos of the same person under the same scene, V1 is a reference action video slice, and n multiplied by n action videos with complementary frames can be generated through the following process. Since there are n action video slices, the end-to-end action states of each video slice may not be in the same state and position. Therefore, the n video slices are subjected to two-by-two frame compensation, and n multiplied by n frame compensation action video slices are generated. Defining a frame supplementing process as f: vixVj→Vij, i ε (1, n), j ε (1, n). And the Vij is generated smooth and smooth frame-supplementing action video slicing. And taking two motion video fragments Vi and Vj in the motion video material library, wherein Vi is the previous video to be spliced, and Vj is the latter video. Taking the start frame Fj of the end frame Fi and Vj of Vi. And supplementing images between Fi and Fj by using a frame supplementing algorithm, realizing continuous smoothness of actions, and splicing the supplemented image frames with Vi to obtain a frame supplementing action video Vij.

According to the method and the device for splicing the video clips, the modes of frame supplementing, action of the target object in the video clips and the like are adopted, so that the spliced target clips can comprise various specific actions of the target object, the actions of the target object in the target clips are smoother and more natural, and the problems of image jump and the like are avoided.

In some embodiments, the spliced target fragments can be used as background videos, and then the audio fragments are utilized to drive actions, expressions, mouth shapes and the like of target objects in the target fragments, so that natural and smooth digital human videos are generated.

In particular, the above method may further comprise the following operations. An audio clip is obtained, the audio clip including a plurality of playback time periods.

Accordingly, obtaining the target slice from the at least two video slices may include: first, a mapping relationship between each playing period of an audio clip and each of at least two video clips is determined. And then, splicing at least two video clips according to the playing time period of the audio clip based on the mapping relation to obtain the target clip.

The mapping relationship may be determined, for example, by way of calibration. Specifically, determining the mapping relationship between each playback period of the audio clip and each of the at least two video clips may include the following operations.

Firstly, calibration information is received, wherein the calibration information comprises the corresponding relation between the playing time interval of the audio fragment and the video fragment. For example, the user designates a certain playing period in the audio clips for playing a certain action video clip or a certain supplementary action video clip in a calibrated manner. The calibration information may be a time stamp and/or a video slice flag, etc.

And then, determining the mapping relation between each playing period of the audio fragments and each of at least two video fragments based on the calibration information.

Fig. 10 is a schematic diagram of a correspondence between audio slices and video slices in an embodiment of the present application.

Referring to fig. 10, a user can determine the required actions of the individual play periods T0, T1, T2, T3, etc., that is, action video clips or frame-complement action video clips V01, V12, V21, V14, etc., by listening to an audio clip or looking at text information corresponding to the audio clip, etc. Thus, the complementary frame action video clips V01, V12, V21, V14 corresponding to the playing periods T0, T1, T2, and T3 in the audio segment can be constructed.

In one particular embodiment, a mapping table may be constructed: time period to frame-complement action video. And marking the audio fragments in a segmented mode according to time, wherein the marked content is a frame supplementing action video corresponding to each time period. The mapping table is shown in table 1.

Table 1: mapping table

Time period	Frame-supplementing action video
		T0-T1	V01
T1-T2	V12
		T2-T3	V21
T3-T4	V14

In some embodiments, the above-described mapping relationship may also be aided by means of artificial intelligence.

For example, determining the mapping relationship between each playback period of an audio clip and each of at least two video clips may include the following operations.

Firstly, analyzing audio fragments to obtain sound characteristics; and/or analyzing text information corresponding to the audio fragments to obtain semantic features.

Sound features include, but are not limited to: speech features, pitch features, and/or volume features. For example, when some content is emphasized, a teacher may use higher volume, or use waving hands, nodding, or the like to attract the attention of a colleague, so that waving hands or nodding the video clips can be used when the volume in the audio clips is higher than a set volume threshold.

The semantic features may characterize actions to be taken by the user. For example, text information corresponding to an audio segment may be obtained first, and then semantic features of the text may be obtained by way of semantic understanding, which facilitates determining a particular action corresponding to the audio segment from the semantic features. For example, the text information includes: please read book page 99. The specific action to which it corresponds may be a low head action (the book is typically placed at a lower height than the head). For example, text information of the audio clips may also be obtained by means of speech recognition.

Then, a mapping relationship between the sound features and/or semantic features and each of the at least two video clips is determined. For example, sound features may have sound feature identifications, semantic features may have semantic feature identifications, and video clips may have video clip identifications. The mapping relationship may thus be formed by storing the sound feature identification and the video clip identification in association, or by storing the semantic feature identification and the video clip identification in association.

In some embodiments, after the mapping relationship is determined, at least two video clips may be spliced according to the playing period of the audio clip based on the mapping relationship, so as to obtain the target clip. Referring to fig. 10, according to the mapping table, the frame-complement motion video is used for stitching, so as to obtain the target slice.

The target fragments can be obtained in the above mode, but the target fragments are background video fragments, and in order to enable the mouth shape of the target object in the background video fragments to be consistent with the voice content in the audio fragments, the target fragments are driven by the audio fragments so as to improve the naturalness of the synthesized video fragments.

The following describes an exemplary procedure for driving a target tile with an audio tile, resulting in a driven target tile. Wherein the audio clip may be an audio clip of a target person captured by a microphone or the like. The audio clip may be edited, noise reduced, etc. In addition, the audio clips may also be audio clips based on speech synthesis techniques. For example, the server side inputs the target text information by calling an interface provided by the speech synthesis platform, and synthesizes an audio clip corresponding to the target text information by the speech synthesis platform.

In particular, one or more audio frames in an audio slice may be utilized to drive one video frame in a target slice. For example, the audio frame characterizes that the current user is in a silence state, and then the mouth is in a closed state in the face image of the target person in the corresponding video frame. For example, the audio frame characterizes that the current user is in a speaking state, and then the mouth is in an open state in the face image of the target person in the corresponding video frame. For example, the audio frame characterizes that the current user is in a state of "good" pronunciation, and then the mouth is in the same shape as "good" pronunciation in the face image of the target person in the corresponding video frame.

It should be noted that, for the motion video slicing including only the reference motion, the computing resources required for video slicing splicing can be reduced by the following special splicing manner. For example, deriving a target slice from at least two video slices may include the following operations for each audio frame in an audio slice.

First, for each video slice of at least two video slices, the video slice is decimated to obtain a sequence of video frames. For example, all frames may be extracted from a certain video slice, and the frames may be arranged in the shooting order (positive order) or reverse order, resulting in a video frame sequence.

And then sequencing video frames in at least two video frame sequences according to a preset video slicing splicing mode to obtain a combined video frame sequence, wherein the video slicing splicing mode comprises mutually reverse splicing between adjacent video slices. For example, the video slicing splicing method includes: the video frame sequences of adjacent video slices are arranged in reverse order with respect to each other.

And then, combining the combined video frame sequences according to a preset frame rate to generate the target fragments.

Specifically, the video slicing splicing method comprises the following steps: n-trans- … …, or trans-n- … …. It should be noted that, the arrangement order of the video frames in the video clips does not require special requirements, for example, the video clips 1 and 2 can be spliced in a positive sequence or a reverse sequence, so long as the playing effect is not affected. Further, the lengths of the video slices may be the same or different. If the second video slice is shorter than the first video slice, it is only necessary to ensure that the end video frame of the second video slice is identical to the start video frame of the first video slice, or that the end video frame of the first video is identical to the start video frame of the second video slice. For example, video frames of slice 1 in fig. 2 may be arranged in a positive order, video frames of slice 2 may be arranged in a reverse order, and video frames of slice 3 may be arranged in a positive order.

The principle of the audio slice driving target slice is exemplarily described below.

In order to facilitate understanding of the technical solutions of the present application, such as the correspondence between audio frames and video frames, the length of the audio frames and the like are described herein as examples.

For example, the length of play time of one frame of audio frame is the inverse of the frame rate of the image. If the frame rate of the image is 50fps, it means that 50 frames are transmitted in one second, and each frame of video frame requires a playing time of 20ms, so that one 20ms of audio may correspond to one frame of video frame. Accordingly, the preset time length is set as the reciprocal of the frame rate, so that the audio output by the segmentation corresponds to the picture, namely, the time alignment of the audio output by the segmentation and the picture is realized.

However, in some scenarios, the frame rate of the audio frames in the audio slices and the frame rate of the video frames in the video slices are different.

For example, the frequency range of normal human hearing is approximately between 20Hz and 20 kHz. The sampling frequency (sampling) refers to the number of samples of the acoustic wave amplitude taken per second when the analog acoustic waveform is digitized. For example, to reduce the distortion rate of sound, the sampling frequency may be greater than 16kHz. Typical audio sampling frequencies are 8kHz, 11.025kHz, 16kHz, 22.05kHz, 37.8kHz, 44.1kHz, 48kHz, etc. For example, a frame of audio frames may be formed at 200 sample points.

The sampling rate is 16KHz, which means 16000 sampling points per second, and the playing duration of the audio frame=the number of sampling points/sampling frequency corresponding to one advanced audio coding (Advanced Audio Coding, abbreviated as ACC) frame, then for a frame rate of 80fps audio frame, the playing duration of the current audio frame=200×1000/16000=12.5 milliseconds (ms). The frame rate of the video frame can be about 25fps, so that the video playing effect can be met, and 25 frames of pictures can be transmitted in one second, so that each frame of pictures needs 1000/25=40 ms of time. It can be seen that the play duration is different between the two.

In order to facilitate the generation of digital person information including audio and video of equal play time length, the correspondence between video frames and audio frames may be determined as follows.

In some embodiments, each of the at least two video slices has a frame rate of a first frame rate f1 and the audio slices has a frame rate of a second frame rate f2, the second frame rate f2 being greater than the first frame rate f1.

Accordingly, one frame of the video slice corresponds to N frames of the audio slice, wherein, is rounded up or +> Is rounded down.

In some embodiments, before driving the target tile with the audio tile, the method may further include: if f2/f1 is a fraction greater than 1, and It is determined that there is an overlap between the audio frame at the end play time of the first play session and the audio frame at the start play time of the second play session.

Accordingly, driving the target slice with the audio slice for the P-th video frame in the target slice, the obtaining of the driven target slice may include the following operations.

Based at least on the audio characteristics of the QxP audio frames through the (Qx (P+1) -1) audio frames, the mouth image of the P video frame in the target slice is adjusted to obtain the driven P video frame.

Specifically, first, a first correspondence is determined, the first correspondence including: the P x Q audio frames to the (Q x (P+1) -1) audio frames of the audio slice correspond to the P video frame of the target slice, wherein the overlapping portion of the (Q x (P+1) -1) audio frames also corresponds to the (P+1) video frame of the target slice.

Then, driving the video frame corresponding to the audio frame by utilizing the audio frame based on the first corresponding relation to obtain the driven target video frame.

Taking the (p+1) -th video frame of the target slice as an example, the overlapping portion of the (q× (p+1) -1) -th audio frame is also exemplified.

A video frame (e.g., aP, etc.) may correspond to a plurality of audio frames (bP, e.g., b 0-b (Q-1)), and P represents a sequence number, e.g., P may be 0,1,2,3, …. The multiple relationship between the audio frame and the video frame may be denoted as Q. The audio frame for driving the 0 th video frame a0 may include: audio frame 0 to (Q-1) th audio frame b (Q-1). The audio frame for driving the 1 st video frame a1 may include : the Q-th audio frame b (Q) through (2 XQ-1) -th audio frame b (2 XQ-1). Wherein, because the number of audio frames is rounded up, the (Q-1) th audio frame b (Q-1) and part of sampling points in the Q-th audio frame b (Q) are used for driving the 1 st video frame a1, so that a certain overlap exists between the two audio frames b (Q-1) and b (Q). Taking f2 and f1 as 80fps and 25fps respectively as an example,the overlapping rate is 4-3.2=0.8, the overlapping time length is 1000/80×0.8=10 ms, the overlapping is not perceived by the sensitivity of the human ear, and the playing effect is not affected.

In addition, a down-rounding operation may be performed on the number of audio frames. However, this would result in the Q-1 th audio frame b (Q-1) being able to cover only part of the playing time of the 0 th video frame a0, and not the last playing time of the 0 th video frame a0, resulting in a certain time interval between the two audio frames b (Q-1), b (Q). Taking f2 and f1 as 80fps and 25fps respectively as an example,the separation rate is 3.2-3=0.2, the separation time is 1000/80×0.2=2.5 milliseconds, the sensitivity of the human ear can not perceive the separation time, and the playing effect is not affected.

Through the embodiment, the corresponding relation between the video frames in the target fragments and the audio frames in the audio fragments can be established, and the generation of the digital human video is facilitated.

In some embodiments, the play durations of the video clips and the audio clips obtained by splicing may not be consistent, and the play durations of the video clips and the audio clips may be kept consistent through a clipping operation or the like. Specifically, obtaining the target segment according to the at least two video segments may include the following operations, if the total playing time of the at least two video segments is longer than the playing time of the audio segment, cutting the at least two video segments based on the playing time of the audio segment to obtain the target segment, where the playing time of the target segment is consistent with the playing time of the audio segment.

Fig. 11 is a schematic diagram of cutting out redundant audio clips according to an embodiment of the present application.

Referring to fig. 11, after a video clip is obtained by a method such as splicing, the playing time period of the video clip may be longer than the playing time period of an audio clip. This may be due to the fact that the playing time of the audio clip and the playing time of the video clip are not integer multiples of each other.

To solve this problem, operations such as cropping may be performed on the video clips, such as removing redundant video frames, so that the playback time length of the audio clip and the playback time length of the target clip remain identical.

For example, deriving the target slice from the at least two video slices may include: and if the total playing time length of the at least two video clips is longer than the playing time length of the audio clip, cutting the at least two video clips based on the playing time length of the audio clip to obtain a target clip, wherein the playing time length of the target clip is consistent with the playing time length of the audio clip.

The following is an exemplary description of a process of driving video slices by audio slices.

In some embodiments, driving the target tile with the audio tile may include: at least part of the video frames in the target slice are driven one by using at least part of the audio frames in the audio slice. Wherein a multi-frame audio frame may be used to drive a video frame. For example, for each audio frame in an audio slice, a video frame corresponding to the audio frame is driven based on the audio characteristics of the audio frame.

There is a correspondence between the audio features of the multi-frame audio frames and the mouth shape features of the target person so as to generate a mouth image of the target person based on the multi-frame audio frames.

Fig. 12 is a schematic diagram of a process of driving a video image according to an embodiment of the present application.

Referring to fig. 12, the server side performs feature extraction in response to the obtained audio clips (e.g., audio clips) and the mouth images of the video frames, respectively, to obtain audio features and video features (e.g., mouth features). And then inputting the fused features (such as spliced audio features and mouth features) into a decoder for decoding to obtain a driven mouth image corresponding to the audio frames in the audio small segments. In this way, the mouth image in the corresponding background image of the video frame can be replaced by the mouth image, so that the driven video frame image is obtained.

Wherein the mouth features may be features automatically extracted, such as by a neural network. The mouth feature may also be a feature extracted based on a preset rule, for example, a plurality of feature points respectively represent a mouth angular position, an upper lip middle position, a lower lip middle position, and the like, so that the shape of the mouth is conveniently represented based on the positions of the plurality of points. The mouth feature may be a combination of a feature automatically extracted by a neural network and a feature extracted based on a preset rule, which is not limited herein.

The audio features may be mouth features and may be features automatically extracted, such as by a neural network. The audio features may also be features extracted based on a preset rule, such as at least one of Mel-frequency cepstrum coefficient (Mel-frequency cepstral coefficients, MFCC), zero-crossing rate, short-time energy, short-time autocorrelation function, spectrogram, short-time power spectral density, short-time average amplitude difference, spectral entropy, fundamental frequency, formants, and the like. The mouth feature may be a combination of a feature automatically extracted by a neural network and a feature extracted based on a preset rule, which is not limited herein.

For example, with respect to mel-frequency cepstral coefficients. Mel frequency is proposed based on the auditory characteristics of the human ear, which has a non-linear correspondence with Hz frequency. The MFCC is then the calculated Hz spectral signature using this relationship between them. The method is mainly used for extracting the characteristics of the voice data and reducing the operation dimension. For example: for a frame, 512-dimensional (sampling point) data can be extracted from the MFCC, and the most important 40-dimensional (general) data can be extracted. Other audio features are not listed here.

The audio feature and the mouth feature have been described above as examples, and the video driven based on the audio feature and the mouth feature is described below as examples. For example, the image of the 0 th video frame in the target tile is adjusted based on the audio characteristics of the 0 th to (Q-1) th audio frames.

In some embodiments, referring to fig. 12, adjusting the image of the P-th video frame in the target slice based at least on the audio characteristics of the q×p-th audio frame through the (q× (p+1) -1) -th audio frame, the obtaining of the driven P-th video frame may include the following operations.

First, audio features are extracted from the q×p-th audio frame to the (q× (p+1) -1) -th audio frame, and mouth features of a target person are extracted from the P-th video frame. For example, audio features of the 0 th to (Q-1) th audio frames are extracted, and the mouth feature of the target person is extracted from the 0 th video frame.

The audio features and the mouth features are then processed using a mouth image generation model to obtain mouth images corresponding to the Qx P audio frames through the (Qx (P+1) -1) audio frames. For example, mouth images corresponding to the 0 th to (Q-1) th audio frames are obtained.

Next, the mouth image of the target person in the P-th video frame is replaced by the mouth image, resulting in a driven P-th video frame. For example, the mouth image of the 0 th video frame a0 in the spliced video clip is replaced with the mouth image generated based on the 0 th to Q-1 th audio frames b0 to b (Q-1).

In some embodiments, the mouth image generation model may include: and the characteristic fusion module and the decoder.

The feature fusion module is used for fusing the audio features and the mouth features to obtain fusion features. The decoder is connected with the feature fusion module and used for decoding the fusion features to obtain a mouth image.

For example, the mouth image generation model may be a neural network, which may include a vocoder, an image encoder, an image decoding generator.

For example, the sound spectrogram of an audio slice is input to a sound encoder, and sound features are extracted by a convolution layer. And simultaneously inputting images of a plurality of video frames corresponding to the acoustic audio fragments into an image encoder, and extracting image features through a convolution layer. The extracted audio features and video features are then input to a decoder, which ultimately generates a sequence of lip images synchronized with the audio slices. The resolution of the lip image includes, but is not limited to, 96×96, 128×128, 256×256, 512×512, etc., and can be set according to the user's needs.

In addition, to generate lip images that more closely conform to the target person, features extracted based on rules, such as face lip keypoint contours, head contours, and backgrounds, etc., may also be included in the decoder input. By adding the features extracted based on the rules, the generated lip images can be controlled more finely, and more controllable high-definition images can be generated.

In some embodiments, referring to fig. 12, the background image of the video frame may be preprocessed to remove the mouth image in the background image, so that the risk that the mouth image in the background image and the generated mouth image are displayed in the driven video frame image at the same time is reduced, and the fault tolerance is improved.

In some embodiments, the above-described method may also train the mouth image generation model as follows. Taking the example that the mouth image generation model is a neural network, the method may include the following operations.

First, a training data set is obtained, the training data in the training data set comprising training audio tiles, training video tiles, and target video tiles.

Then, for a jth audio frame in the training audio tile and a kth video frame in the training video tile, audio features are extracted from the jth audio frame, and mouth features are extracted from the kth video frame. Wherein, there is a correspondence between j and k, the correspondence includes: Or (F)>Where j and k are integers greater than or equal to 0.

Then, the audio feature and the mouth feature are input into a mouth image generation model, and the model parameters of the mouth image generation model are adjusted so that the difference between the mouth image output by the mouth image generation model and the mouth image in the kth video frame in the target video slice is smaller than a difference threshold. For example, model parameters are obtained by minimizing the loss function. Among the model parameters include, but are not limited to: weight and offset.

Specifically, the mouth image generation model learns the mapping relation between the audio characteristics and the video characteristics in the training process, so that the generated face lip image sequence is smoother and more natural, and the generation requirements of different video scenes and speaking characters are met.

For example, the discrimination network performs lip-sync discrimination according to the audio slice and the lip image sequence to obtain a lip-sync discrimination value, and optimizes the mouth image generation model according to the lip-sync discrimination value. For example, the discrimination network obtains image authenticity probability values according to the lip image sequence and a plurality of frame images in the target video clips, and optimizes the mouth image generation model according to the image authenticity probability values.

In some embodiments, the discrimination network may be divided into a lip-sync discrimination network and an image quality discrimination network. The lip synchronization judging network is used for detecting the lip synchronization generated by the mouth image generating model in the training process and giving out a lip synchronization judging value, so that the mouth image generating model can be conveniently trained to generate more truly synchronous lip images. The image quality judging network is used for detecting the image quality in the training process and outputting a generated reality probability value between the mouth image and the target image, so that the mouth image generating model is convenient to train to generate a higher-definition real image.

For example, the lip synchronization judging network may be a pre-training network, input into audio clips and lip images generated correspondingly, output into the synchronous matching degree of each lip image and the corresponding audio clip, and the judging device judges and gives out a lip synchronization judging value, so as to train the mouth image generating model to optimize and improve, and generate the lip image more synchronous with the sound. The image quality judging network and the mouth image generating model are trained simultaneously, the image quality judging network inputs the generated lip images and the lip images of the video frames in the target video clips, and the image quality judging network outputs the generated lip images and the lip images of the video frames in the target video clips as image authenticity probability values. The image quality judging network is used for judging whether the generated image quality is good or not, and the mouth image generating model is trained in the training process to generate a more realistic lip image.

In some embodiments, the input to the mouth image generation model may also include the angle of rotation of the face about the plumb line. Some actions may exist for the target person in the background video, such as turning head, nodding, etc. If the generated mouth images are all generated for lip images at an angle, they may not be suitable for these particular scenes, such as causing discomfort to the mouth image and the face image. And a rotation angle is added in the input of the mouth image generation model, so that a lip image aiming at the rotation angle can be obtained, and the fidelity of the synthesized video is improved.

In a specific embodiment, an audio slice for driving video and a target slice to be driven are taken as a voice driving algorithm f: v x A.fwdarw.V. And using a voice driving algorithm to change the mouth shape of a target object of each video frame in the target fragments according to the content of the audio frame, so that the mouth shape in the output driven video fragments is more consistent with the audio fragments.

In some embodiments, after the driven target tile is obtained, the driven target tile may be utilized to perform a number of functions such as teaching, booting, etc., which are not listed here.

Figure 13 is a schematic diagram of outputting a driven target tile in an embodiment of the present application,

referring to fig. 13, a second material library is first constructed. Then, a first material library with smooth action and frame-supplementing action video slicing is generated by using a frame-supplementing algorithm aiming at the second material library. Next, the audio fragments are analyzed, the splicing positions (time periods) of the video fragments of the frame-compensating action are marked, and the video fragments of the frame-compensating action are spliced according to the marks to obtain target fragments, please refer to fig. 10. Finally, voice driving is carried out on the spliced target fragments, so that the alignment of the mouth shape and the voice content is realized, and the driven target fragments are obtained. In an application scene driven by computer vision and voice, smooth video with actions such as limbs, faces and mouth shapes is generated according to audio content.

In the embodiment of the application, a second material library is established, and the second material library contains a plurality of action video fragments, so that a user can splice the action video fragments of the material library according to the content of the audio fragments to obtain videos containing character limbs, facial actions and the like which accord with audio semantics. Because the action material library is a truly collected video, the action material library can provide the most realistic action state of the characters, so that the characters in the synthesized video fragments are more realistic and natural.

In addition, the frame is supplemented to the collected action video fragments, so that smooth and smooth transition of each action video fragment in the target fragments is realized. The frame-supplementing video slicing of smooth character actions can be generated in a frame-supplementing mode, and the frame-supplementing video slicing can prevent actions, positions and the like from being suddenly changed, blocked and the like in the target slicing.

In addition, the obtained smooth target fragment is subjected to voice driving. And driving the smooth target fragments by using a preset audio fragment through a voice driving algorithm to realize audio lip alignment.

In addition, by adopting the method of advanced audio annotation and advanced video stitching, the matching effect of the action of previewing the generated video and the audio content can be visualized in advance, so that the generated video is more efficient.

Fig. 14 is a schematic structural diagram of an entity apparatus for performing the video generating method in the embodiment of the present application.

Referring to fig. 14, a schematic diagram of an electronic device 1400 shown in fig. 14 is shown. The electronic device 1400 in the embodiment of the present application can implement operations corresponding to the video generation method performed in the embodiment corresponding to fig. 3 described above. The functions performed by the electronic device 1400 may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, and the modules may be software and/or hardware. The electronic device 1400 may include a processing module and a storage module, where the functional implementation of the processing module may refer to operations performed in the embodiment corresponding to fig. 3, and are not described herein.

Specifically, the electronic device 1400 includes: a memory 1410, and at least one processor 1420. In addition, the electronic device 1400 may further include an input/output module configured to obtain at least an audio slice and at least two video slices, where a playing duration of the audio slice is at least longer than a playing duration of each of the at least two video slices.

Wherein the memory 1410 is used for storing a computer program, and the processor 1420 is used for calling the computer program stored in the memory 1410 to perform the method as described above.

In some embodiments, the processor 1420 is further configured to obtain a target slice according to at least two video slices, where a play time period of the target slice is greater than or equal to a play time period of the audio slice; driving the target fragments by utilizing the audio fragments to obtain driven target fragments; the method comprises the steps that target objects in video clips correspond to at least one specific action, target objects in at least two video clips correspond to different specific actions, target objects in the target clips correspond to at least two specific actions, a first video clip has no frame which jumps in a first playing period and a second video clip has no frame which jumps in a second playing period, the starting playing time of the first playing period of the first video clip is later than the playing starting time of the first video clip, the ending playing time of the second playing period of the second video clip is earlier than the ending playing time of the second video clip, and the ending playing time of the first video clip in the first playing period is identical to the starting playing time of the second video clip in the second playing period.

Another aspect of the present application also provides a server.

Referring to fig. 15, the server 150 may vary considerably in configuration or performance and may include one or more central processing units (collectively, central processing units, abbreviated as CPU) 1522 (e.g., one or more processors) and memory 1532, one or more storage media 1530 (e.g., one or more mass storage devices) storing applications 1542 or data 1544. Wherein the memory 1532 and the storage medium 1530 may be transitory or persistent storage. The program stored on the storage medium 1530 may include one or more modules (not shown), each of which may include a series of instruction operations on the server. Still further, a central processor 1522 may be provided in communication with the storage medium 1530, executing a series of instruction operations on the server 1520 in the storage medium 1530.

The Server 1520 may also include one or more power supplies 1526, one or more wired or wireless network interfaces 1550, one or more input/output interfaces 1558, and/or one or more operating systems 1541, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like.

The steps performed by the server in the above embodiments may be based on the structure of the server 1520 shown in fig. 15. The steps performed by the electronic device 1400 shown in fig. 14 in the above-described embodiments, for example, may be based on the server structure shown in fig. 15. For example, the processor 1522 performs the following operations by calling instructions in the memory 1532.

At least two video slices are obtained through input output interface 1558.

The processor 1522 obtains a target slice from at least two video slices. The method comprises the steps that a target object in a video slice corresponds to at least one specific action, the target object in the target slice corresponds to at least two specific actions, a first video slice has no frame hopped in a first playing period and a second video slice has no frame hopped in a second playing period, the starting playing time of the first playing period is later than the playing starting time of the first video slice, the ending playing time of the second playing period is earlier than the ending playing time of the second video slice, and the ending playing time of the first playing period is identical to the starting playing time of the second playing period.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments. It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and modules described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of modules is merely a logical function division, and there may be additional divisions of actual implementation, e.g., multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When a computer program is loaded onto and executed by a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). Computer readable storage media can be any available media that can be stored by a computer or data storage devices such as servers, data centers, etc. that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing describes in detail the technical solution provided by the embodiments of the present application, in which specific examples are applied to illustrate the principles and implementations of the embodiments of the present application, where the foregoing description of the embodiments is only used to help understand the methods and core ideas of the embodiments of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope according to the ideas of the embodiments of the present application, the present disclosure should not be construed as limiting the embodiments of the present application in summary.

Claims

1. A method of synthesizing video, the method comprising:

obtaining at least two video clips;

splicing the at least two video clips based on a preset sequence to obtain a spliced target clip; the preset sequence comprises a positive sequence or a reverse sequence;

the method comprises the steps that target objects in video clips correspond to at least one specific action, target objects in at least two video clips correspond to a reference action and different specific actions, the target objects in the target clips correspond to at least two specific actions, a first video clip does not jump frames in a first playing period and a second video clip does not jump frames in a second playing period, the playing time of the first video clip at the end of the first playing period is identical to the playing time of the second video clip at the beginning of the second playing period, and the target objects in the video clips correspond to the reference actions in the first playing period and the second playing period;

When the target object in the at least two video clips only corresponds to the reference action, splicing the at least two video clips based on the reverse order;

and/or the number of the groups of groups,

and when the end video frame of the first video slice and the end video frame of the second video slice are the same, or the start video frame of the first video slice and the start video frame of the second video slice are the same, splicing the first video slice and the second video slice based on the reverse order.

2. The method of claim 1, wherein the at least two video slices are obtained by:

at least obtaining at least two complementary frame action video clips Vij from a first material library, wherein the first material library comprises N complementary frame action video clips Vij, and each complementary frame action video clip Vij comprises an action video clip Vi and at least one complementary frame video clip Bij; the frame-supplementing video slicing Bij enables the motion video slicing Vi to have no frame hopped in the first playing period, and the frame-supplementing video slicing Bij enables the motion video slicing Vj to have no frame hopped in the second playing period, wherein N is an integer greater than or equal to 2, and i and j are integers greater than or equal to 1 respectively;

And taking the at least two frame-supplementing action video clips Vij as the at least two video clips.

3. The method according to claim 2, wherein the target object in the motion video clip Vi corresponds to a reference motion and at least one specific motion, and the target object in the motion video clip Vi corresponds to the reference motion in the first playback period and the second playback period.

4. The method of claim 2, wherein the first library of material is constructed by:

obtaining N motion video slices Vi, i is less than or equal to N, and N=n ² ；

And generating a frame-supplementing motion video slice Vij of the motion video slice Vi for the motion video slice Vj for each of the n motion video slices Vi to be added into the first material library, wherein i and j are the same or different.

5. The method of claim 4, wherein generating the motion video slice Vi for the complementary frame motion video slice Vij of the motion video slice Vj comprises:

generating a complementary frame video slice Bij of the action video slice Vi aiming at the action video slice Vj;

the frame-complement action video slice Vij is obtained by one of the following: setting the frame-supplementing video slicing Bij at the last video frame of the action video slicing Vi to obtain the frame-supplementing action video slicing Vij; or, setting the frame-supplementing video slice Bij before the first video frame of the action video slice Vj to obtain the frame-supplementing action video slice Vij; or, setting the first part of the complementary frame video slice Bij after the last video frame of the action video slice Vi, and setting the second part of the complementary frame video slice Bij before the first video frame of the action video slice Vj, so as to obtain the complementary frame action video slice Vij.

6. The method of claim 5, wherein generating the complementary frame video slice Bij of the motion video slice Vi for the motion video slice Vj comprises:

obtaining the last video frame of the action video slice Vi and the first video frame of the action video slice Vj;

obtaining a plurality of supplementary video frames between the last video frame of the action video clip Vi and the first video frame of the action video clip Vj through a frame supplementing algorithm, so that the last video frame of the action video clip Vi, the first video frame of the action video clip Vj and no jump frame exist between the last video frame of the action video clip Vi and the first video frame of the action video clip Vj;

and combining the plurality of supplementary video frames according to a preset frame rate to obtain the supplementary frame video slicing Bij, wherein the playing time length of the supplementary frame video slicing Bij is a designated time length.

7. The method of claim 1, wherein the at least two video slices are obtained by:

obtaining at least two motion video clips Vi from a second material library, wherein the second material library comprises n motion video clips Vi, a target object in each motion video clip Vi corresponds to a reference motion and at least one specific motion, and for each of at least part of the motion video clips Vi, the target object of the motion video clip Vi corresponds to the reference motion in the first playing period and the second playing period, the reference motion and the specific motion are the same or different, n is an integer greater than or equal to 2, and i is an integer greater than or equal to 1;

And taking the at least two action video clips Vi as the at least two video clips.

8. The method of claim 1, wherein the splicing the at least two video clips based on the preset order results in a spliced target clip, comprising:

for each video slice in the at least two video slices, performing frame extraction on the video slice to obtain a video frame sequence;

sequencing at least two video frame sequences according to a preset video slicing playing sequence to obtain a combined video frame sequence;

and combining the combined video frame sequences according to a preset frame rate to generate the target fragments.

9. The method according to claim 2, wherein when the values of i and j in the complementary frame motion video slices Vij are the same, the video frame sequences in two adjacent complementary frame motion video slices Vij are arranged in reverse order to each other.

10. The method of any of claims 1 to 9, wherein the video frames of the first video clip during a first playback period are different from the video frames of the second video clip during the second playback period.

11. The method according to any one of claims 1 to 9, wherein:

The first similarity is greater than the second similarity, wherein the first similarity is the similarity between a subsequent video frame of the first video clip in the first playing period and a first video frame of the second video clip, the second similarity is the similarity between a preceding video frame of the first video clip in the first playing period and a first video frame of the second video clip, and the plurality of video frames of the first video clip and the second video clip are each arranged in time sequence, the preceding video frame being arranged before the subsequent video frame; or alternatively

The third similarity is greater than a fourth similarity, wherein the third similarity is a similarity between a preceding video frame of the second video clip and a last video frame of the first video clip, the preceding video frame is arranged in time sequence, and the fourth similarity is a similarity between a following video frame of the second video clip and a last video frame of the first video clip, the plurality of video frames of the first video clip and the second video clip are each arranged in time sequence, and the preceding video frame is arranged before the following video frame.

12. A video processing apparatus, the apparatus comprising:

at least one processor, memory, and input-output module;

the input/output module is at least used for obtaining at least two video fragments, the memory is used for storing a computer program, and the processor is used for calling the computer program stored in the memory to execute:

obtaining at least two video clips;

and/or the number of the groups of groups,

13. The apparatus of claim 12, wherein the processor is configured to:

14. The apparatus of claim 13, wherein a target object in the motion video clip Vi corresponds to a reference motion and at least one specific motion, the target object of the motion video clip Vi corresponding to the reference motion during the first playback period and the second playback period.

15. The apparatus of claim 13, wherein the processor is configured to:

16. The apparatus of claim 15, wherein the processor is configured to:

17. The apparatus of claim 16, wherein the processor is configured to:

18. The apparatus of claim 12, wherein the processor is configured to:

19. The apparatus of claim 12, wherein the processor is configured to:

20. The apparatus of claim 13, wherein when the values of i and j in the complementary frame motion video slices Vij are the same, the video frame sequences in two adjacent complementary frame motion video slices Vij are arranged in reverse order to each other.

21. The apparatus of any of claims 12 to 20, wherein the video frames of the first video slice during a first playback period are different from the video frames of the second video slice during the second playback period.

22. The apparatus according to any one of claims 12 to 20, wherein:

23. A computer readable storage medium, characterized in that it comprises instructions which, when run on a computer, cause the computer to perform the method according to any of claims 1-11.