CN116980543A

CN116980543A - Video generation method, device, storage medium and computer equipment

Info

Publication number: CN116980543A
Application number: CN202211655729.3A
Authority: CN
Inventors: 杨司琪; 王智圣; 邱东洋
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-12-21
Filing date: 2022-12-21
Publication date: 2023-10-31

Abstract

The application discloses a video generation method, which comprises the following steps: obtaining a segment sequence of target music, wherein the segment sequence comprises a plurality of music segments; performing sequence search on the segment sequences based on the dance action drawing to obtain dance action sequences composed of dance segments matched with each music segment; generating a plurality of transition sequences corresponding to the dance action sequences according to key frames corresponding to each pair of adjacent dance fragments in the dance action sequences; inputting each transition sequence into an action completion model to perform action completion, and obtaining dance video corresponding to target music; the action complement model is obtained based on action distance loss training. According to the application, an artificial intelligence technology is applied, and the transition sequence of each pair of adjacent dance fragments in the dance motion sequence is subjected to motion completion through the motion completion model, so that the motion connection between the dance fragments in the dance motion sequence is more accurate and natural, the fluency of dance motion synthesis is improved, and the quality of dance video generation is further improved.

Description

Video generation method, device, storage medium and computer equipment

Technical Field

The present application relates to the field of computer vision, and more particularly, to a video generating method, apparatus, storage medium, and computer device.

Background

Dance motion composition refers to the fact that a given piece of music can generate dance motion video of a corresponding genre depending on the genre of that piece of music. The use of dance composition has become very popular in the gaming and video industries. For example, a game developer generates jazz, quadratic elements, street dances and other different types of dance animation according to the music style by using a dance motion synthesis system, and generates adaptive dance motion resources for different game items.

The dance motion synthesis scheme of the related technology mainly comprises the following steps: according to the method based on the action diagram, the dance video obtained by the method can have the phenomena of large action change, sliding and the like, and the dance video obtained by the method based on the deep learning has low accuracy rate of dance actions in the dance video obtained by the network end-to-end learning. In summary, current dance motion synthesis is of poor quality.

Disclosure of Invention

The embodiment of the application provides a video generation method, a video generation device, a storage medium and computer equipment. The fluency of dance motion synthesis can be improved, and then, the quality of video generation is improved.

In one aspect, an embodiment of the present application provides a video generating method, including: acquiring a segment sequence of target music, wherein the segment sequence comprises a plurality of music segments; performing sequence search on the segment sequences based on the dance action drawing to obtain dance action sequences composed of dance segments matched with each music segment; generating a plurality of transition sequences corresponding to the dance action sequences according to key frames corresponding to each pair of adjacent dance fragments in the dance action sequences; inputting each transition sequence into an action completion model to perform action completion, and obtaining dance video corresponding to target music; the action completion model is obtained based on action distance loss training, and the action distance loss is determined by Euclidean distance of a predicted action sequence and a label action sequence.

On the other hand, the embodiment of the application also provides a video generating device, which comprises: the music piece acquisition module is used for acquiring a piece sequence of target music, wherein the piece sequence comprises a plurality of music pieces; the action sequence searching module is used for carrying out sequence searching on the segment sequences based on the dance action drawing to obtain a dance action sequence composed of dance segments matched with each music segment; the transition sequence generation module is used for generating a plurality of transition sequences corresponding to the dance action sequences according to key frames corresponding to each pair of adjacent dance fragments in the dance action sequences; the dance video generation module is used for inputting each transition sequence into the action completion model to complete the action, so as to obtain dance video corresponding to the target music; the action completion model is obtained based on action distance loss training, and the action distance loss is determined by Euclidean distance of a predicted action sequence and a label action sequence.

In another aspect, an embodiment of the present application further provides a computer readable storage medium storing a computer program, where the video generating method described above is performed when the computer program is executed by a processor.

On the other hand, the embodiment of the application also provides a computer device, which comprises a processor and a memory, wherein the memory stores a computer program, and the computer program executes the video generation method when being called by the processor.

In another aspect, embodiments of the present application also provide a computer program product comprising a computer program stored in a storage medium; the processor of the computer device reads the computer program from the storage medium, and the processor executes the computer program to cause the computer device to perform the steps in the above-described calculation engine determination method.

The video generation method provided by the application can acquire the segment sequence of the target music, wherein the segment sequence comprises a plurality of music segments, and the segment sequence is subjected to sequence search based on the dance action diagram to obtain the dance action sequence consisting of dance segments matched with each music segment, and further, a plurality of transition sequences corresponding to the dance action sequence are generated according to key frames corresponding to each pair of adjacent dance segments in the dance action sequence. And inputting each transition sequence into the action completion model to perform action completion, so as to obtain the dance video corresponding to the target music. Therefore, the transition sequence of each pair of adjacent dance fragments in the dance action sequence is subjected to action complementation by using the action complementation model, so that actions between mutually adjacent key frames in the transition sequence can be complemented, the action connection between dance fragments in the dance action sequence is more accurate and natural, and higher fluency is achieved, and the quality of dance action synthesis is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of a system architecture according to an embodiment of the present application.

Fig. 2 shows a flowchart of a video generating method according to an embodiment of the present application.

Fig. 3 is a schematic diagram of a dance action diagram according to an embodiment of the present application.

Fig. 4 shows a network architecture diagram of a cadence prediction model according to an embodiment of the present application.

Fig. 5 shows a schematic diagram of finding an optimal path according to an embodiment of the present application.

FIG. 6 is a schematic diagram of a key frame of a dance segment according to an embodiment of the present application.

Fig. 7 shows a network architecture diagram of an action completion network according to an embodiment of the present application.

Fig. 8 is a flowchart illustrating another video generating method according to an embodiment of the present application.

Fig. 9 shows an application scenario diagram of a video generating method according to an embodiment of the present application.

Fig. 10 shows a schematic diagram of generating a rhythm symbol according to an embodiment of the present application.

Fig. 11 is a block diagram of a video generating apparatus according to an embodiment of the present application.

Fig. 12 is a block diagram of a computer device according to an embodiment of the present application.

Fig. 13 is a block diagram of a computer readable storage medium according to an embodiment of the present application.

Detailed Description

Embodiments of the present application are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are exemplary only for explaining the present application and are not to be construed as limiting the present application.

In some of the processes described in the specification, claims and drawings above, a number of steps occurring in a particular order are included, but it should be understood that the steps may be performed out of order or performed in parallel, the sequence numbers of the steps merely being used to distinguish between the various steps, the sequence numbers themselves not representing any order of execution. Furthermore, the descriptions of "first" and "second" and the like herein are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.

In order to enable those skilled in the art to better understand the solution of the present application, the following description will make clear and complete descriptions of the technical solution of the present application in the embodiments of the present application with reference to the accompanying drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that, in the specific embodiment of the present application, related data such as music clips, dance clips, and training sets are required to obtain user permission or consent when applied to specific products or technologies of the embodiments of the present application, and the collection, use, and processing of related data are required to comply with related laws and regulations and standards of related countries and regions.

Dance motion synthesis is based on an input musical sequence, generating a dance motion sequence matching the style and rhythm of the musical sequence. In the prior art, a scheme based on action drawings searches for dance segments matching each music segment in a music sequence on the action drawings, and directly splices the adjacent dance segments to be used as a dance action sequence. However, due to the difference between adjacent dance movements, the phenomena of huge abrupt change of movements, sliding steps and the like can occur in the direct splicing movements.

However, in the method for synthesizing dance movements based on deep learning, in the process of performing end-to-end learning on the neural network, due to the limited training data, the accuracy of the dance movements obtained by learning is low, and the controllability of the generated dance movements is poor, so that unnatural twisting movements can be displayed. In summary, it can be seen that the current dance motion synthesis has the problem of poor synthesis quality. In order to solve the above problems, the inventors have studied and have proposed a video generation method provided by an embodiment of the present application.

The architecture of the system of the video generation method according to the present application will be described first.

As shown in fig. 1, the video generating method provided by the embodiment of the present application may be applied to a system 300, where a data obtaining device 310 is configured to obtain training data. For the video generating method according to the embodiment of the present application, the training data may include a sample dance segment, a sample music segment, and a sample motion segment for model training, and a tag motion sequence, a first tag rhythm symbol, and a second tag rhythm symbol for training, where the tag motion sequence, the first tag rhythm symbol, and the second tag rhythm symbol for training may be manually pre-calculated. After the data acquisition device 310 acquires the training data, the training data may be stored in the database 320, and the training device 330 may train to obtain the target model 301 based on the training data maintained in the database 320.

Specifically, the training device 330 may train a preset neural network based on the input training data until the preset neural network meets the preset condition, to obtain the trained target model 301. The preset conditions may be: the total loss value of the target loss function is smaller than a preset value, the total loss value of the target loss function is not changed any more, or the training times reach the preset times, and the like. The object model 301 can be used to implement the video generation method in an embodiment of the present application.

The target model 301 in the embodiment of the present application may be a deep neural network model, for example, a convolutional neural network. In an actual application scenario, the training data maintained in the database 320 is not necessarily all from the data acquisition device 310, but may be received from other devices, for example, the client device 360 may also be used as a data acquisition end, and the acquired data may be used as new training data and stored in the database 320. In addition, the training device 330 may not be configured to train the preset neural network based on the training data maintained by the database 320, and may train the preset neural network based on the training data obtained from the cloud or other devices.

The target model 301 obtained by training according to the training device 330 may be applied to different systems or devices, such as the execution device 340 shown in fig. 1, where the execution device 340 may be a terminal, for example, a mobile phone terminal, a tablet computer, a notebook computer, an augmented Reality (Augmented Reality, AR)/a Virtual Reality (VR), or may be a server or a cloud, but is not limited thereto.

In fig. 1, the execution device 340 may be used for data interaction with an external device, for example, a user may send input data to the execution device 340 over a network using the client device 360. The input data may include, in an embodiment of the present application: target music sent by client device 360. In preprocessing the input data by the execution device 340, or in performing processing related to computation or the like by the execution module 341 of the execution device 340, the execution device 340 may call data, programs or the like in the data storage system 350 for corresponding computation processing, and store data and instructions such as processing results obtained by the computation processing in the data storage system 350.

Finally, the execution device 340 may return the processing result, that is, the dance video generated based on the object model 301, to the client device 360 through the network, so that the user may query the processing result on the client device 360. It should be noted that the training device 330 may generate, based on different training data, a corresponding target model 301 for different targets or different tasks, and the corresponding target model 301 may be used to achieve the targets or to perform the tasks, thereby providing the user with the desired result.

Illustratively, the system 300 shown in FIG. 1 may be a Client-Server (C/S) system architecture, the execution device 340 may be a cloud Server deployed for a service provider, and the Client device 360 may be a notebook computer used by a user. For example, a user may upload, by using dance motion synthesis software in a notebook computer, target music to be dance motion synthesized to a cloud server through a network, and when the cloud server receives the target music, generate a dance motion sequence, generate a plurality of transition sequences corresponding to the dance motion sequence, perform motion complement on the plurality of transition sequences by using a target model 301 to generate a dance video, and return the dance video to the notebook computer, so that the user may obtain the dance video on the dance motion synthesis software.

It should be noted that fig. 1 is only a schematic diagram of a system provided by an embodiment of the present application, and the architecture and application scenario of the system described in the embodiment of the present application are for more clearly describing the technical solution of the embodiment of the present application, and do not constitute a limitation on the technical solution provided by the embodiment of the present application. For example, the data storage system 350 in FIG. 2 is external memory to the execution device 340, and in other cases, the data storage system 350 may be disposed in the execution device 340. The execution device 340 may also be a client device directly. As can be known to those skilled in the art, with the evolution of the system architecture and the appearance of new application scenarios, the technical solution provided by the embodiment of the present application is also applicable to solving similar technical problems.

Referring to fig. 2, fig. 2 is a flow chart illustrating a video generating method according to an embodiment of the application. In a specific embodiment, the video generating method is applied to the video generating apparatus 500 shown in fig. 11 and the computer device 600 (fig. 12) configured with the video generating apparatus 500.

In the following, a specific flow of the present embodiment will be described by taking a computer device as an example, and it is to be understood that the computer device applied in the present embodiment may be a server or a terminal, and the server may be an independent physical server, or may be a server cluster or a distributed system formed by multiple physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communications, middleware services, domain name services, security services, CDNs, blockchains, and basic cloud computing services such as big data and artificial intelligence platforms. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a smart speaker, a smart watch, etc. The video generation method specifically comprises the following steps:

step S110: a sequence of pieces of target music is acquired.

In the embodiment of the application, the consistency of the dance motion style and the music style is the basic requirement of dance motion synthesis, so that the basic requirement can be met by matching dance fragments with consistent styles for different music fragments in a piece of music. Where target music refers to a music signal, e.g., a song, for which dance video needs to be generated. By time-segmenting the music signal, a plurality of pieces of music can be obtained, and further, the plurality of pieces of music constitute a piece sequence of the target music.

In one embodiment, when the target music is acquired, the target music may be segmented according to a fixed duration to obtain a plurality of pieces of music with the same time length, and further, a sequence of pieces of music with the same time length is formed by the plurality of pieces of music with the same time length. For example, given a song with a duration of 60 seconds, the song may be segmented according to a duration of 2 seconds, resulting in 30 pieces of music with a duration of 2 seconds, and further, the 30 pieces of music form a corresponding piece sequence of the song.

Step S120: and performing sequence search on the segment sequences based on the dance action graph to obtain a dance action sequence consisting of dance segments matched with each music segment.

Wherein, dance action drawing refers to a Directed Graph (Directed Graph) composed of a plurality of dance segments, and dance segments refer to video segments having the same duration as music segments. Alternatively, the dance segment may be obtained from a dance motion library, where dance motion segments of different dance styles and rhythms may be stored.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a dance action drawing. As shown in fig. 3, the dance action Graph (Motion Graph) is a directed Graph composed of a plurality of vertices (vertexes) and edges (edges). Each vertex in the dance action graph represents a dance segment, e.g., each dance segment has a duration of 2 seconds, contains 60 frames, and has a frame rate of 30FPS (Frame Per Second).

Each edge in the dance action graph represents the action Transition Cost (Transition Cost) between two adjacent vertices, denoted as C _T The action transition cost is used for representing the fluency of the transition from the dance segment corresponding to one vertex to the dance segment corresponding to the other vertex. Alternatively, the motion transition cost may be calculated from the Euclidean distance (L2 distance) between transition frames of dance segments corresponding to two adjacent vertices.

Optionally, the transition frame is an average value based on a fixed number of video frames selected from the dance segment. For example, for dance segment D in dance action _p And dance segment D _q Dance segment D may be used _p Average of middle and last 5 video frames as transition frameWill dance segment D _q Average of the first 5 video frames as transition frame +.>

Alternatively, the Euclidean distance may include at least the L2 distance between the positions of the human joints between transition framesL2 distance between rotation angles +.>L2 distance between motion speedsThus, dance segment D _p And dance segment D _q Cost of transition of motion C between _T (D _p ,D _q ) Can be expressed as the following formula:

in particular, when constructing dance action diagrams based on dance action libraries, if the action transition cost between two dance segments is less than a certain threshold delta _T The two vertexes corresponding to the two dance segments are considered to be connectable to form an edge of the dance action drawing; otherwise, it is not connectable. Wherein the direction of the edge can be determined by the rhythm of the dance movements of the two dance segments, a threshold delta _T May be determined from empirical values for a particular experimental procedure. Thus, the fluency of the two dance segments is calculated through the action transition cost to construct the dance action diagram.

It is contemplated that in the motion map based framework, each synthesized motion corresponds to a path in the motion map. Thus, the present application treats synthesizing dance movements for input target music as finding the best path in dance movements graph that satisfies various rules of arrangement (e.g., style consistent, dance corresponds to music tempo, dance movements smooth, etc.).

Therefore, the objective of finding the optimal path may be to assign a dance segment in a dance action diagram to a music segment of target music so that the dance segment matches the music segment, and an action transition between adjacent dance segments (i.e., nodes) is more natural.

Therefore, the application utilizes the music matching function to represent the cost of style and rhythm matching between the music segments and the dance segments, and utilizes the action matching function to represent the cost of smooth transition between adjacent motion segments in the synthetic dance motion, thereby quantifying the optimization target for searching the optimal path and facilitating the calculation of the matched dance segments for each music segment of the target music.

In some embodiments, the step of searching the segment sequence based on the dance action map to obtain a dance action sequence composed of dance segments matched with each music segment may include:

(1) In the dance action diagram, sequence searching is sequentially carried out on each music segment in the segment sequence according to the search function, and dance segments matched with each music segment are obtained.

The search function comprises at least one of a music matching function or an action matching function, the music matching function is determined by a rhythm prediction model, and the action matching function is determined by Euclidean distances of two transition frames corresponding to two adjacent dance segments.

In order to better understand the rhythm relation between music and dance and facilitate obtaining dance segments matched with each music segment, each beat is expressed as a binary vector, namely a rhythm mark, and the rhythm mark is an 8-dimensional binary vector and consists of 0 or 1, wherein 0 represents a weak beat in the beat, and 1 represents a re-beat in the beat.

Since consecutive zeros in the rhythm marks represent the continuous or smooth periods in the music piece and the dance piece, a Hamming Distance (Hamming Distance) can be used to calculate differences in the rhythm marks corresponding to the music piece and the dance piece, respectively, and determine a music matching function based on the differences in the rhythm marks. For this purpose, the application uses a rhythm prediction model to determine rhythm marks corresponding to music segments and dance segments.

Referring to FIG. 4, FIG. 4 showsA network architecture diagram of a rhythm prediction model is shown in FIG. 4, wherein the rhythm prediction model comprises an action rhythm prediction module F _d And music tempo prediction module F _m . The classification layer of the action tempo prediction module and the classification layer of the music tempo prediction module share weight parameters. Specifically, the action rhythm prediction module and the music rhythm prediction module respectively extract characteristics of the input music fragments and dance fragments by using 2 convolution layers and 1 full-connection layer, and then send the music fragments and dance fragments into 3 full-connection layers (classification layers) with shared weights for classification, so as to obtain rhythm marks corresponding to the music fragments and the dance fragments.

In one embodiment, in the dance action diagram, the step of sequentially searching each music segment in the segment sequence according to the search function to obtain a dance segment matched with each music segment may include:

(1.1) in the dance action drawing, a sequence search is performed on an initial musical piece in a piece sequence, and a dance piece having the smallest music matching function value with the initial musical piece is determined as the initial dance piece.

(1.2) taking the initial music piece as a target music piece and taking the initial dance piece as a target dance piece.

Referring to fig. 5, fig. 5 shows a schematic diagram of searching for an optimal path. As shown in fig. 5, the sequence of pieces of target music consists of 4 pieces of music. Initial music piece M ₁ Is its next music piece M ₂ . The dance action drawing comprises n dance fragments, and the adjacent dance fragments of the initial dance fragment are concentrated into 5 adjacent dance fragments.

For example, each dance segment Di (n > i > 0) in the dance action graph may be calculated&i e Nw) and the initial music piece M ₁ Music match function value C of (2) _R (M ₁ ,D _i ) The calculation formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,prediction module F for music tempo _m Predicted initial musical piece M ₁ Is a rhythm symbol of the above. />Prediction module F for motion rhythm _d Predicted ith dance segment D _i Rhythm sign of->Representing the difference in cadence signature.

Further, the dance action is plotted with the initial music piece M ₁ Music match function value C of (2) _R (M ₁ ,D _i ) Determining the minimum dance segment as the initial dance segment D ₁ . Will initiate music segment M ₁ As a target music piece, and to an initial dance piece D ₁ As a target dance segment.

(1.3) acquiring adjacent music pieces of the target music piece in the piece sequence, and acquiring an adjacent dance piece set of the target dance piece in the dance action.

Illustratively, a contiguous musical piece M of the target musical piece in the piece sequence is acquired ₂ And acquiring a set of adjacent dance segments of the target dance segment in the dance action drawing

(1.4) in the set of adjacent dance segments, performing a sequence search on the adjacent music segments, and determining an adjacent dance segment having a smallest sum of the music matching function values with the adjacent music segments and the action matching function value with the target dance segment as the matching dance segment.

Illustratively, each contiguous dance segment in the contiguous dance segment set is acquiredFurther, music rhythm prediction module F based on rhythm prediction model _m And action rhythm prediction module F _d Obtaining a music match function value between each adjacent dance segment and the adjacent music segment>

Further, dance segment D based on each adjacent dance segment matching with the target music segment ₁ Obtaining an action matching function value between each adjacent dance segment and the dance segment matched with the target music segment according to Euclidean distance of the corresponding transition frameThe calculation process of (2) is the same as the excessive cost, and the calculation formula is as follows:

wherein, the liquid crystal display device comprises a liquid crystal display device,dance segment D matching target music segment ₁ Excessive frame of->For the kth adjacent dance segment in the set of adjacent dance segments +.>Is an excessive frame of (a).

Further, a contiguous dance segment having the smallest sum of the music matching function value of the contiguous dance segment set and the action matching function value of the target dance segment is determined as a matching dance segment, namely D ₂ . Specifically, the music match function valueMatching function value with action->The sum of which is calculated as follows:

wherein, alpha and beta are weight parameters, which can be determined by empirical values of specific experimental processes.

And (1.5) updating the matched dancing segments into target dancing segments, updating the adjacent music segments into target music segments, and iteratively returning to the step of acquiring the adjacent music segments of the target music segments in the segment sequence until each music segment in the segment sequence finishes the sequence search, thereby obtaining the dancing segments matched with each music segment.

Illustratively, according to the above process of searching for matching dance segments, in dance action diagrams, a music segment M may be sequentially found ₃ Matched dance segment D ₃ Music piece M ₄ Matched dance segment D ₄ . And further obtaining dance segments matched with each music segment.

(2) And obtaining a dance action sequence corresponding to the target music based on each dance segment.

Illustratively, a dance motion sequence { D ] corresponding to the target music can be obtained based on each dance segment ₁ ,D ₂ ,D ₃ ,D ₄ }。

Step S130: and generating a plurality of transition sequences corresponding to the dance motion sequences according to the key frames corresponding to each pair of adjacent dance segments in the dance motion sequences.

In the dance motion sequence generated based on the dance motion map, although the motions of every two adjacent dance segments have certain fluency, the rhythm and style between each dance segment and the corresponding music segment are very matched. However, since the dance movements are directly spliced, the problem of movement difference is easy to occur, and the phenomenon that the dance movements in the synthesized dance video suddenly change greatly, and the human eyes can intuitively feel the dance movements, such as sliding, is caused.

It is readily appreciated that a complete dance video is far more simple than a smooth stitching of a batch of dance segments together, even though each segment is aesthetically pleasing. But from the aspect of professional dance and art, feedback such as strong action piecing sense, poor musical action fitness and the like can be given. Therefore, the application creatively provides an action complement method which can complement the actions of the transition sequence generated based on the dance action sequence, so that the actions between adjacent key frames in the complemented dance action sequence can be naturally linked, and the quality of dance video synthesis is effectively improved.

In one embodiment, the step of generating a plurality of transition sequences corresponding to the dance motion sequence according to the key frames corresponding to each pair of adjacent dance segments in the dance motion sequence may include:

(1) And acquiring a first key frame of the first dance segment from the dance action sequence, and taking the first key frame as a first frame.

(2) And obtaining a second key frame of the second dance segment from the dance action sequence, and taking the second key frame as an end frame.

The first dance segment is a preface dance segment adjacent to the second dance segment. The locations of the first key frame and the second key frame in the respective corresponding dance segments are symmetrical in the dance action sequence. Referring to fig. 6, fig. 6 is a schematic diagram of a key frame of a dance segment.

As shown in FIG. 6, the first key frame is the 5 th frame of the first dance segment, the second key frame is the 5 th frame of the second dance segment, and the positions of the first key frame and the second key frame in the dance motion sequence are symmetrical. Alternatively, the determination of the locations of the first and second key frames may be determined based on empirical values of specific experiments.

For example, a first key frame of a first dance segment may be obtained from a dance motion sequence with the first key frame as a first frame and a second key frame of a second dance segment may be obtained from the dance motion sequence with the second key frame as a last frame.

(3) And performing linear interpolation calculation based on the first frame and the last frame to obtain a plurality of intermediate frames.

Wherein the linear interpolation is a linear calculation based on the first frame and the last frame, e.g. given a linear parameter y, based on the first frame I _frist And end frame I _last Performing linear interpolation to calculate intermediate frame I _middle The calculation formula of (2) can be as follows:

I _middle ＝γI _frist +(1-γ)I _last

illustratively, different intermediate frames I can be obtained by adjusting the linear parameter γ _middle . The number of intermediate frames generated is determined by the dimension of the transition sequence that is ultimately required. The dimension of the transition sequence may be determined based on an empirical value of a specific experiment, and in the embodiment of the present application, the dimension of the transition sequence may be 10, and further the number of generated intermediate frames is 8.

(4) And determining a transition sequence corresponding to the first dance segment and the second dance segment according to the first frame, the last frame and each intermediate frame so as to obtain a plurality of transition sequences corresponding to the dance motion sequence.

For example, a corresponding first frame, a corresponding last frame, and each intermediate frame may be determined for every two adjacent dance segments in the dance motion sequence, and a transition sequence may be formed by the corresponding first frame, last frame, and each intermediate frame, so that a plurality of transition sequences corresponding to the dance motion sequence may be obtained.

For example, the dimension of the oversubsequence is set to 10, and the dance segment D of the dance motion sequence ₁ And dance segment D ₂ Based on the first frame I ₀ Last frame I ₉ And each intermediate frame (I ₁ ,…,I ₈ ) A transition sequence corresponding to the first dance segment and the second dance segment can be determined

Step S140: and inputting each transition sequence into the action completion model to perform action completion, and obtaining the dance video corresponding to the target music.

The dance video is composed of a plurality of dance segments according to a certain sequence, each dance segment comprises a plurality of video frames, and it can be understood that the dance video is essentially a sequence composed of a plurality of video frames and provided with dance attribute sequences, but because of professional dance attributes such as rhythms, styles and the like of dance movements, action differences easily occur among the dance segments.

Considering that dance video is serialized Data (Sequential Data), the special structure of a transducer can capture long-distance interdependent features in a sequence more easily, and the relation between any two sequence elements in the sequence is directly represented by a calculation result, so that the distance between the long-distance interdependent features is greatly shortened, and the features are beneficial to being effectively utilized.

Therefore, the application provides the transition sequence corresponding to the dance motion sequence to be motion-complemented based on the motion complement model of the transducer, so that the motion connection between mutually adjacent key frames in the transition sequence is more accurate and natural, the splicing sense of the integral dance motion in the dance video is further reduced, and the fitness of the integral dance motion is improved. The action completion model is obtained based on action distance loss training, and the action distance loss is determined by Euclidean distance of a predicted action sequence and a label action sequence.

As an implementation manner, each transition sequence is input to the motion complement model to perform motion complement, specifically, the transition sequence may be input to the motion complement model to perform embedding operation on the transition sequence, so as to obtain an embedded vector, then feature extraction is performed on the embedded vector, so as to obtain a feature vector corresponding to the embedded vector, and convolution calculation is performed on the feature vector, so as to obtain the complement sequence. Further, replacing the key frame at the corresponding position of the transition sequence in the dance action sequence with the key frame in the complement sequence to obtain the complemented dance action sequence, namely the dance video corresponding to the target music.

Referring to fig. 7 for an exemplary illustration, fig. 7 shows a network architecture diagram of an action completion network. As shown in fig. 7, the action completion network may include three presentation layers, a feature extraction layer, and an output layer, wherein the three presentation layers may be a convolution layer (Convolution Layer), a gesture embedding layer (Positional Embedding Layer), and a key frame embedding layer (Keyframe Embedding Layer), respectively. The feature extraction layer may be Transformer Encoder (encoder) and the output layer may be a convolutional layer.

When the transition sequence P= { I is obtained ₀ ,I ₁ ,…,I ₈ ,I ₉ When in the process of } inputting the transition sequence P into a convolution layer to obtain a segment representation (Token Embedding), inputting the transition sequence P into a gesture Embedding layer to obtain a gesture representation (Positional Embedding), and inputting the transition sequence P into a key frame Embedding layer to obtain a key frame representation (Keyframe Embedding).

Further, adding the segment representation, the gesture representation and the key frame representation to obtain an embedded vector, inputting the embedded vector to Transformer Encoder for feature extraction to obtain a corresponding feature vector, and inputting the feature vector to an output layer (convolution layer) for convolution calculation to obtain a complement sequence corresponding to the transition sequence And performing action complement on each transition sequence through the action complement model to obtain a complement sequence corresponding to each transition sequence in the dance action sequence, and further obtaining the dance video corresponding to the target music.

According to the embodiment of the application, the segment sequence of the target music can be obtained, wherein the segment sequence comprises a plurality of music segments, the segment sequence is searched based on the dance action diagram to obtain the dance action sequence composed of dance segments matched with each music segment, and further, a plurality of transition sequences corresponding to the dance action sequence are generated according to key frames corresponding to each pair of adjacent dance segments in the dance action sequence. And inputting each transition sequence into the action completion model to perform action completion, so as to obtain the dance video corresponding to the target music. And the transition sequences of each pair of adjacent dance segments in the dance motion sequence are subjected to motion completion through the motion completion model, so that the motions between the mutually adjacent key frames in the transition sequences are completed, the motion connection between the dance segments in the dance motion sequence is more accurate, natural and smooth, the piecewise sense of integral dance motions in a dance video is reduced, the integral degree of the concordance of the dance motions is improved, and the generation quality of the dance video is further improved.

The methods described in connection with the above embodiments are described in further detail below by way of example.

The video generation method of the present application relates to artificial intelligence (Artificial Intelligence, AI) technology, which is a theory, method, technique and application system that simulates, extends and expands human intelligence, senses environment, acquires knowledge and uses knowledge to obtain optimal results using a digital computer or a machine controlled by a digital computer. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

One branch of Computer Vision (CV) technology as an artificial intelligence technology is a science for researching how to make a machine "look at", and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, detection and measurement on a target, and further perform graphic processing, so that the Computer processing becomes an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data.

Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, OCR, video generation, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality (e.g., virtual persons), augmented reality, synchronous positioning, and map construction, among others, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and others.

The video generating method provided in this embodiment relates to technologies such as computer vision of artificial intelligence, and will be described below by taking a specific integration of a video generating apparatus in a computer device as an example, and details will be described with respect to a flowchart shown in fig. 8 in conjunction with an application scenario shown in fig. 9, where the computer device may be a server or a terminal device. Referring to fig. 8, fig. 8 illustrates another video generating method according to an embodiment of the present application, and in a specific embodiment, the video generating method may be applied to the dance creative scene shown in fig. 9.

The dance creative service provider provides a service that includes a cloud training server 410 and a cloud execution server 430. The cloud training server 410 may be used for training an action complement model and a rhythm prediction model for dance action synthesis, and the cloud execution server 430 is used for deploying the action complement model and the rhythm prediction model obtained by training on the cloud training server 410, and synthesizing dance actions on target music sent by the client through the action complement model and the rhythm prediction model to obtain dance videos. Wherein, when the client uses dance creative service for the user, dance creative service software 421 is opened on computer 420.

It should be noted that fig. 9 is only one application scenario provided by the embodiment of the present application, and the application scenario described in the embodiment of the present application is for more clearly describing the technical solution of the embodiment of the present application, and does not constitute a limitation to the technical solution provided by the embodiment of the present application. For example, in other cases, computer 420 may deploy the action completion model and the cadence prediction model trained on cloud training server 410, so that dance action synthesis may be accomplished directly on computer 420. Those skilled in the art can know that with the evolution of the system architecture and the appearance of new application scenarios (such as game items and live broadcast, etc.), the technical solution provided by the embodiment of the present application is also suitable for solving similar technical problems. The video generation method specifically comprises the following steps:

Step 210: the computer device obtains a sample dance segment set.

According to the application, the action complement model is used for carrying out action complement on a plurality of transition sequences corresponding to the dance action sequences, so that action connection between mutually adjacent key frames in the transition sequences is more accurate and natural, thereby reducing the splicing sense of the integral dance actions in the dance video and improving the fitness of the integral dance actions.

The motion completion model is obtained by training on a sample dance segment set based on motion distance loss, and the motion distance loss is determined by Euclidean distance of a predicted motion sequence and a label motion sequence. The sample dance segment set comprises a plurality of pairs of sample dance segments, and each pair of sample dance segments comprises a first sample segment and a second sample segment.

For example, the sample dance segment set may be manually pre-recorded, and similarly, the cloud training server 410 may obtain the sample dance segment set from the database 320 storing training data, or directly obtain the sample dance segment set from other devices, such as a user device, for example.

It should be noted that, in the video generating method provided in the embodiment of the present application, training of the preset action completing network may be performed in advance according to the obtained training sample data set, and then, when dance action synthesis (dance video generation) needs to be performed each time, the action completing model obtained by training may be directly used for calculation, without repeating the network training each time dance action synthesis is performed.

Step 220: the computer equipment carries out linear interpolation calculation based on the first sample segment and the second sample segment, and determines a sample action sequence corresponding to each pair of sample dance segments.

In the embodiment of the application, the action of action completion can be given any two dance segments, key frames corresponding to the two dance segments are input to the action completion module, and further, the action completion module can output actions of other frames between the two key frames, so that the completed actions achieve a natural and coherent effect.

As one embodiment, the computer device performs linear interpolation calculation based on the first sample segment and the second sample segment, and the step of determining a sample action sequence corresponding to each pair of sample dance segments may include:

(1) The computer device obtains a first sample key frame of the first sample fragment and takes the first sample key frame as a sample header frame.

(2) The computer device obtains a second sample key frame of the second sample fragment and takes the second sample key frame as a sample end frame.

Wherein the positions of the first sample key frame and the second sample key frame in the respective corresponding sample segments are symmetrical. For example, the first sample key frame is the first sample fragment penultimate 5 frame, and correspondingly, the second sample key frame is the second sample fragment positive 5 th frame.

Illustratively, the cloud training server 410 needs to generate a vector of input motion completion network with dimension 10, i.e., sample motion sequence P' = { I ₀ ,I ₁ ,…,I ₈ ,I ₉ }. The cloud training server 410 may obtain the 5 th last key frame in the first sample segment, i.e., the first sample key frame, and take the 5 th last key frame as the sample first frame I ₀ . Further, the cloud training server 410 may obtain the positive 5 th key frame in the second sample segment, that is, the second sample key frame, and take the positive 5 th key frame as the sample end frame I ₉ 。

(3) The computer equipment performs linear interpolation calculation based on the first frame of the sample and the last frame of the sample to obtain a plurality of intermediate frames of the sample.

(4) And the computer equipment determines a sample action sequence corresponding to each pair of sample dance fragments according to the sample first frame, the sample last frame and each sample middle frame.

Illustratively, the cloud training server 410 may calculate the formula based on linear interpolation: i _i ＝γ'I ₀ +(1γ')I ₉ ，(0＜i＜9&i∈N ^* ) Calculate 8 sample intermediate sample frames { I } ₁ ,…,I ₈ Further, a sample motion sequence P' = { I is obtained ₀ ,I ₁ ,…,I ₈ ,I ₉ }. Likewise, cloud training server 410 may determine a sample action sequence corresponding to each pair of sample dance segments.

Step 230: the computer equipment inputs the sample action sequence into a preset action complement network to complete the action, and the action complement network outputs a predicted action sequence corresponding to the sample action sequence.

The preset action complement network can comprise three representation layers, a feature extraction layer and an output layer, wherein the three representation layers can be a convolution layer, a gesture embedding layer and a key frame embedding layer respectively. The feature extraction layer may be Transformer Encoder and the output layer may be a convolution layer.

As one embodiment, the step of inputting the sample action sequence into a preset action completion network for action completion by the computer device, and outputting the predicted action sequence corresponding to the sample action sequence by the action completion network may include:

(1) The computer equipment inputs the sample action sequence into a preset action complement network, and performs embedding operation on the sample action sequence to obtain an embedding vector corresponding to the sample action sequence.

Specifically, the computer device may input the sample motion sequence to a convolution layer of a preset motion completion network, calculate to obtain a segment representation, and input the sample motion sequence to a first embedding layer of the preset motion completion network, calculate to obtain a gesture representation. Furthermore, the computer equipment inputs the sample action sequence to a second embedding layer of a preset action complement network, calculates to obtain a key frame representation, and obtains an embedding vector corresponding to the sample action sequence based on the segment representation, the gesture representation and the key frame representation.

Illustratively, the cloud training server 410 may sequence the sample actions P' = { I ₀ ,I ₁ ,…,I ₈ ,I ₉ The method comprises the steps of inputting the data to a convolution layer, a gesture embedding layer and a key frame embedding layer of an action completion network respectively, obtaining corresponding segment representation, gesture representation and key frame representation, and further adding the segment representation, gesture representation and key frame representation to obtain an embedding vector.

(2) And the computer equipment performs feature extraction on the embedded vector to obtain a feature vector corresponding to the embedded vector.

(3) The computer equipment carries out convolution calculation on the feature vector to obtain a predicted action sequence.

Illustratively, the cloud training server 410 may input the embedded vector to Transformer Encoder for feature extraction to obtain a feature vector corresponding to the embedded vector, and further, the cloud training server 410 may input the feature vector to a convolution layer for convolution calculation to output a sample motion sequence P' = { I ₀ ,I ₁ ,…,I ₈ ,I ₉ Predicted motion sequence corresponding to

Step 240: the computer device determines an action distance loss of the action completion network based on the Euclidean distance of the predicted action sequence and the tag action sequence.

Illustratively, cloud training server 410 may base a predicted sequence of actionsAnd tag action sequence- >Determining an action distance loss of the action completion network, the action distance loss +.>The calculation formula of (2) is as follows:

step 250: the computer equipment carries out iterative training on the action completion network based on the action distance loss until the action completion network meets the preset condition, and an action completion model is obtained.

Illustratively, the cloud training server 410 may be based on the action distance penaltyAnd performing iterative training on the action completion network until the action completion network meets the preset condition to obtain an action completion model. Further, cloud training server 410 may send the action completion model to cloud execution server 430, and cloud execution server 430 may deploy the action completion model for performing the task of action completion. />

It should be noted that, the preset conditions may be: the total loss value of the motion distance loss is smaller than a preset value, the total loss value of the motion distance loss is not changed, or the training times reach the preset times, and the like. Alternatively, an optimizer may be employed to optimize the motion distance loss, setting the Learning Rate (Learning Rate), the Batch Size at training (Batch Size), and the time period of training (Epoch) based on experimental experience.

Step 260: the computer device obtains a sequence of pieces of target music.

For example, when the cloud execution server 430 acquires the target music, the target music may be segmented according to a fixed duration, so as to obtain a plurality of pieces of music with the same time length, and further, a sequence of pieces of music is formed by the plurality of pieces of music with the same time length. For example, given a song with a duration of 60 seconds, cloud execution server 430 may segment the song for a duration of 2 seconds to obtain 30 pieces of music with a duration of 2 seconds, and perform a sequence of pieces of music with the 30 pieces of music forming the corresponding sequence of pieces of the song.

Step 270: the computer device performs a sequence search on the segment sequence based on the dance action map to obtain a dance action sequence composed of dance segments matched with each music segment.

In the embodiment of the application, synthesizing dance motions for input target music is regarded as finding the optimal path meeting various arrangement rules (consistent style, dance corresponding to music rhythm and smooth dance motions) in a dance motion diagram. And quantifying an optimization target for searching an optimal path based on the cost of the music matching function for representing style and rhythm matching between the music piece and the dance piece and the cost of the motion matching function for representing smooth transition between adjacent motion pieces in the synthesized dance motion, and calculating the matched dance piece for each music piece of target music.

In some embodiments, the computer device may perform a sequence search on the segment sequence based on the dance action map to obtain a dance action sequence consisting of dance segments matched for each music segment;

(1) And the computer equipment sequentially searches each music segment in the segment sequence according to the search function in the dance action diagram to obtain dance segments matched with each music segment.

The search function comprises at least one of a music matching function or an action matching function, the music matching function is determined by a rhythm prediction model, and the action matching function is determined by Euclidean distances of two transition frames corresponding to two adjacent dance segments. Alternatively, the tempo prediction model may be trained by:

the cloud training server 410 may obtain a first training set of rhythms comprising a sample piece of music and a first tag rhythm symbol, and a second training set of rhythms comprising a sample action piece and a second tag rhythm symbol. The first tag cadence signature may be manually noted by a professional listening to a sample piece of music, and the first tag cadence signature may be an 8-dimensional vector manually noted. The vector consists of 0 or 1, where 0 represents a weak beat in the beat and 1 represents a re-beat in the beat. The second label rhythm symbol may be manually marked by a professional by viewing the sample action segment, and the second label rhythm symbol may be an 8-dimensional vector manually marked. The vector consists of 0 or 1, where 0 represents a weak beat in the beat and 1 represents a re-beat in the beat.

Further, the cloud training server 410 may obtain a preset tempo prediction network, where the tempo prediction network includes an action tempo prediction module and a music tempo prediction module, and the classification layers of the action tempo prediction module and the classification layers of the music tempo prediction module share weight parameters.

Further, the cloud training server 410 may perform iterative training on the tempo prediction network based on the sample music piece, the first tag tempo symbol, the sample action piece, and the second tag tempo symbol until the tempo prediction network meets a preset condition, and obtain a tempo prediction model.

Specifically, the cloud training server 410 may input the sample music piece to the music tempo prediction module to perform music tempo prediction, so as to obtain a predicted music tempo token. And inputting the sample action segment into an action rhythm prediction module to predict the action rhythm, so as to obtain a predicted action rhythm mark.

Referring to fig. 10, fig. 10 shows a schematic diagram of generating a rhythm symbol. When the sample music piece is obtained, signal analysis can be carried out on the sample music piece, information energy and a music start signal corresponding to the sample music piece are obtained based on the music information of the sample music piece, and then vector representations composed of the information energy and the music start signal are input into the music rhythm prediction module.

When the sample motion segment is obtained, the rotation angle of the joint point of the motion object of each frame motion in the obtained sample motion segment can be calculated, the motion curvature of the hand motion of the motion object and the foot step mark (the height value of the foot surface of the motion object from the ground) are calculated according to the rotation angle, and then the vector representation composed of the rotation angle, the motion curvature of the hand motion and the foot step mark is input to the motion rhythm prediction module.

Action rhythm prediction module and music rhythmThe prediction module extracts the characteristics of the input vector representation by using 2 convolution layers and 1 full-connection layer respectively, and sends the extracted characteristics into 3 full-connection layers (classification layers) with shared weights for classification to obtain a predicted music rhythm symbol R corresponding to the music segment and the dance segment _m And a predicted action rhythm symbol R _d 。

Further, the cloud training server 410 may determine a target loss corresponding to the playing prediction network based on the predicted musical tempo token, the predicted action tempo token, the first tag tempo token, and the second tag tempo token. In particular, cloud training server 410 may be based on a predicted musical tempo token R _m And a first label rhythm symbolHamming distance between them, determining a music tempo loss L _m Based on the predicted action rhythm symbol R _d And a second tag rhythm symbol->Hamming distance between them, determining motion rhythm loss L _d . The calculation formula may be as follows:

thus, the cloud training server 410 may determine the target loss L corresponding to the performance prediction network according to the music tempo loss and the action tempo loss _rhythm The calculation formula may be as follows:

L _rhythm＝ θL _m +μL _d

wherein θ and μ are weight parameters, which can be determined from empirical values of a specific experimental process. Further, cloud training server 410 may lose L according to the goal _rhythm Iterative training of a performance prediction networkUntil the cadence prediction network meets a preset condition.

The preset conditions may be: the total loss value of the target loss is smaller than a preset value, the total loss value of the target loss is not changed any more, or the training times reach the preset times, etc. Alternatively, an optimizer may be employed to optimize the motion distance loss, setting the learning rate, the batch size during training, and the period of training based on experimental experience.

(1.1) the computer device performs a sequence search for an initial musical piece in the sequence of pieces in the dance action map, and determines a dance piece having a smallest music matching function value with the initial musical piece as the initial dance piece.

(1.2) the computer device regarding the initial musical piece as a target musical piece, and regarding the initial dance piece as a target dance piece.

Illustratively, the cloud execution server 430 may be based on a music matching function Calculate each dance segment D in dance action _i (n＞i＞0&i∈N ^* ) And the initial music piece M ₁ Is a music match function value of (c). Further, the dance action is plotted with the initial music piece M ₁ Music match function value C of (2) _R (M ₁ ,D _i ) Determining the minimum dance segment as the initial dance segment D ₁ . Will initiate music segment M ₁ As a target music piece, and to an initial dance piece D ₁ As a target dance segment.

(1.3) the computer device obtaining a contiguous musical piece of the target musical piece in the piece sequence, and obtaining a contiguous dance piece set of the target dance piece in the dance action.

(1.4) the computer device performing a sequence search on the adjacent music pieces in the set of adjacent dance pieces, determining as the matching dance piece the adjacent dance piece having the smallest sum of the music matching function value with the adjacent music piece and the action matching function value with the target dance piece

Illustratively, cloud execution server 430 may obtain each of the set of contiguous dance segmentsFurther, music rhythm prediction module F based on rhythm prediction model _m And action rhythm prediction module F _d Obtaining a music match function value between each adjacent dance segment and the adjacent music segment>

Further, the cloud execution server 430 may perform the dance segment D matching the target music segment based on each of the adjacent dance segments ₁ The Euclidean distance of the corresponding transition frame respectively obtains the action matching function value between each adjacent dancing segment and the dancing segment matched with the target music segment

Further, the cloud execution server 430 may concentrate the adjacent dance segments with the sum of the music matching function values of the adjacent music segments and the action matching function values of the target dance segments Determining the smallest adjacent segment as matching segment, i.e. D ₂ 。

And (1.5) updating the matched dancing segments into target dancing segments by the computer equipment, updating the adjacent music segments into target music segments, and iteratively returning to the step of acquiring the adjacent music segments of the target music segments in the segment sequence until each music segment in the segment sequence is subjected to sequence search to obtain the dancing segments matched with each music segment.

Illustratively, the cloud execution server 430 may follow the above process of finding matching dance segments, in which the music segment M may be found in turn ₃ Matched dance segment D ₃ Music piece M ₄ Matched dance segment D ₄ . And further obtaining dance segments matched with each music segment.

(2) The computer equipment obtains a dance action sequence corresponding to the target music based on each dance segment.

For example, the cloud execution server 430 may obtain a dance motion sequence corresponding to the target music based on each dance segment.

Step 280: and the computer equipment generates a plurality of transition sequences corresponding to the dance motion sequences according to the key frames corresponding to each pair of adjacent dance segments in the dance motion sequences.

In the embodiment of the application, the transition sequence generated based on the dance motion sequence can be subjected to motion completion, so that the motions between the adjacent key frames in the completed dance motion sequence can be naturally linked, and the quality of dance video synthesis is effectively improved.

In one embodiment, the step of generating, by the computer device, a plurality of transition sequences corresponding to the dance motion sequence according to key frames corresponding to each pair of adjacent dance segments in the dance motion sequence may include:

(1) The computer equipment obtains a first key frame of the first dance segment from the dance action sequence, and takes the first key frame as a first frame.

(2) The computer device obtains a second key frame of the second dance segment from the dance action sequence, and takes the second key frame as an end frame.

The first dance segment is a preface dance segment adjacent to the second dance segment. The locations of the first key frame and the second key frame in the respective corresponding dance segments are symmetrical in the dance action sequence.

Illustratively, the cloud execution server 430 may obtain the 5 th last frame of the first dance segment from the dance action sequence, and take the 5 th last frame as the first frame. And the positive number 5 th frame of the second dance segment can be obtained from the dance action sequence, and the positive number 5 th frame is taken as an end frame.

(3) The computer equipment performs linear interpolation calculation based on the first frame and the last frame to obtain a plurality of intermediate frames.

Illustratively, the cloud execution server 430 may obtain different intermediate frames by adjusting the linearity parameters. Specifically, 8 intermediate frames are calculated according to a linear interpolation calculation formula.

(4) And the computer equipment determines a transition sequence corresponding to the first dance segment and the second dance segment according to the first frame, the last frame and each intermediate frame so as to obtain a plurality of transition sequences corresponding to the dance action sequence.

For example, the cloud execution server 430 may determine, for every two adjacent dance segments in the dance motion sequence, a corresponding first frame, a corresponding last frame, and each intermediate frame, and form a transition sequence from the corresponding first frame, last frame, and each intermediate frame, so that a plurality of transition sequences corresponding to the dance motion sequence may be obtained.

Step 290: and the computer equipment inputs each transition sequence into the action completion model to complete the action, so as to obtain the dance video corresponding to the target music.

For example, the cloud execution server 430 may input each transition sequence to the motion completion model to perform motion completion, specifically, may input the transition sequence to the motion completion model to perform an embedding operation on the transition sequence to obtain an embedded vector, perform feature extraction on the embedded vector to obtain a feature vector corresponding to the embedded vector, and perform convolution calculation on the feature vector to obtain a completion sequence. Further, replacing the key frame at the corresponding position of the transition sequence in the dance action sequence with the key frame in the complement sequence to obtain the complemented dance action sequence, namely the dance video corresponding to the target music.

Further, the cloud executing server 430 may send the dance video corresponding to the generated target music to the dance creative service software 421 on the computer 420 through the network, so that the user views the dance video.

In the embodiment of the application, a sample dance segment set can be obtained, wherein the sample dance segment set comprises a plurality of pairs of sample dance segments, each pair of sample dance segments comprises a first sample segment and a second sample segment, linear interpolation calculation is performed on the basis of the first sample segment and the second sample segment, a sample action sequence corresponding to each pair of sample dance segments is determined, further, the sample action sequence is input into a preset action complement network to perform action complement, the action complement network outputs a predicted action sequence corresponding to the sample action sequence, and the action distance loss of the action complement network is determined according to the Euclidean distance of the predicted action sequence and the label action sequence, and further, iterative training is performed on the action complement network based on the action distance loss until the action complement network meets preset conditions, so that an action complement model is obtained.

When the segment sequence of the target music is obtained, the segment sequence can be searched based on the dance action diagram to obtain a dance action sequence composed of dance segments matched with each music segment, and a plurality of transition sequences corresponding to the dance action sequence are generated according to key frames corresponding to each pair of adjacent dance segments in the dance action sequence, and then each transition sequence is input into the action complement model to carry out action complement, so that the dance video corresponding to the target music is obtained. Therefore, the transition sequences of each pair of adjacent dance segments in the dance motion sequence are subjected to motion complement through the motion complement model, so that the motions between the mutually adjacent key frames in the transition sequences are complemented, the motion connection between the dance segments in the dance motion sequence is more accurate and natural, the splicing sense of the integral dance motion in the dance video is reduced, and the synthetic quality of the dance video is improved.

Referring to fig. 11, a block diagram of a video generating apparatus 500 according to an embodiment of the application is shown. The video generating apparatus 500 includes: a segment acquisition module 510, configured to acquire a segment sequence of target music, where the segment sequence includes a plurality of music segments; the sequence search module 520 is configured to perform a sequence search on the segment sequence based on the dance action map to obtain a dance action sequence composed of dance segments matched with each music segment; the sequence generating module 530 is configured to generate a plurality of transition sequences corresponding to the dance motion sequence according to the key frames corresponding to each pair of adjacent dance segments in the dance motion sequence; the video generating module 540 is configured to input each transition sequence to the motion complement model to perform motion complement, so as to obtain a dance video corresponding to the target music; the action completion model is obtained based on action distance loss training, and the action distance loss is determined by Euclidean distance of a predicted action sequence and a label action sequence.

In some embodiments, the sequence generation module 530 may be specifically configured to: acquiring a first key frame of a first dance segment from a dance action sequence, and taking the first key frame as a first frame; acquiring a second key frame of a second dance segment from the dance action sequence, taking the second key frame as an end frame, and enabling the first dance segment to be a preface dance segment adjacent to the second dance segment; performing linear interpolation calculation based on the first frame and the last frame to obtain a plurality of intermediate frames; and determining a transition sequence corresponding to the first dance segment and the second dance segment according to the first frame, the last frame and each intermediate frame so as to obtain a plurality of transition sequences corresponding to the dance motion sequence.

In some embodiments, the video generating apparatus 500 may further include: the system comprises a first sample acquisition module, a second sample acquisition module and a control module, wherein the first sample acquisition module is used for acquiring a sample dance fragment set, the sample dance fragment set comprises a plurality of pairs of sample dance fragments, and each pair of sample dance fragments comprises a first sample fragment and a second sample fragment; the linear interpolation calculation module is used for carrying out linear interpolation calculation based on the first sample segment and the second sample segment and determining a sample action sequence corresponding to each pair of sample dancing segments; the dance motion completion module is used for inputting the sample motion sequence into a preset motion completion network to complete the motion, and the motion completion network outputs a predicted motion sequence corresponding to the sample motion sequence; the first loss determining module is used for determining the loss of the action distance of the action completion network according to the Euclidean distance of the predicted action sequence and the label action sequence; and the action completion training module is used for carrying out iterative training on the action completion network based on the action distance loss until the action completion network meets the preset condition to obtain an action completion model.

In some embodiments, the linear interpolation computation module may include: the first-bit frame acquisition unit is used for acquiring a first sample key frame of the first sample fragment and taking the first sample key frame as a sample first-bit frame; the end frame acquisition unit is used for acquiring a second sample key frame of a second sample segment, taking the second sample key frame as a sample end frame, and the first sample segment is a preamble dance segment adjacent to the second sample segment; the intermediate frame acquisition unit is used for carrying out linear interpolation calculation based on the first sample frame and the last sample frame to obtain a plurality of sample intermediate frames; and the action sequence determining unit is used for determining a sample action sequence corresponding to each pair of sample dance fragments according to the sample first frame, the sample last frame and each sample middle frame.

In some embodiments, the dance completion module may include: the embedding subunit is used for inputting the sample action sequence into a preset action complement network, and performing embedding operation on the sample action sequence to obtain an embedding vector corresponding to the sample action sequence; the extraction subunit is used for extracting the characteristics of the embedded vector to obtain a characteristic vector corresponding to the embedded vector; and the convolution subunit is used for carrying out convolution calculation on the feature vector to obtain a predicted action sequence.

In some embodiments, the embedding sub-unit may be specifically configured to: inputting a sample action sequence into a convolution layer of a preset action completion network, and calculating to obtain a segmented representation; inputting a sample action sequence to a first embedded layer of a preset action completion network, and calculating to obtain gesture expression; inputting a sample action sequence into a second embedded layer of a preset action completion network, and calculating to obtain a key frame representation; and obtaining an embedded vector corresponding to the sample action sequence based on the segment representation, the gesture representation and the key frame representation.

In some embodiments, the sequence search module 520 may include: the sequence searching unit is used for sequentially searching each music segment in the segment sequence according to the searching function in the dance action drawing to obtain a dance segment matched with each music segment; the search function comprises at least one of a music matching function or an action matching function, the music matching function is determined by a rhythm prediction model, and the action matching function is determined by Euclidean distances of two transition frames corresponding to two adjacent dance segments; and the sequence generating unit is used for obtaining a dance action sequence corresponding to the target music based on each dance segment.

In some embodiments, the sequence search unit may include: a searching subunit, configured to perform a sequence search on an initial music segment in the segment sequence in the dance action diagram, and determine a dance segment with a smallest music matching function value with the initial music segment as an initial dance segment; an initial subunit, configured to take an initial music segment as a target music segment, and take an initial dance segment as a target dance segment; an acquisition subunit, configured to acquire an adjacent music segment of the target music segment in the segment sequence, and acquire an adjacent dance segment set of the target dance segment in the dance action map; a calculation subunit for performing a sequence search on adjacent music pieces in the set of adjacent dance pieces, and determining an adjacent dance piece having a smallest sum of a music matching function value with the adjacent music pieces and an action matching function value with the target dance piece as a matching dance piece; and the iteration subunit is used for updating the matched dancing segment into a target dancing segment, updating the adjacent music segments into target music segments, and iteratively returning to the step of acquiring the adjacent music segments of the target music segments in the segment sequence until each music segment in the segment sequence finishes the sequence search, so as to obtain the dancing segment matched with each music segment.

In some embodiments, the computing subunit may be specifically configured to: acquiring each adjacent dance segment in the adjacent dance segment set; a music rhythm prediction module and an action rhythm prediction module based on the rhythm prediction model obtain a music matching function value between each adjacent dance segment and each adjacent music segment; and obtaining an action matching function value between each adjacent dance segment and the dance segment matched with the target music segment based on the Euclidean distance of the transition frame corresponding to each dance segment matched with the target music segment.

In some embodiments, the video generating apparatus 500 may further include: the third sample acquisition module is used for acquiring a first playing training set, wherein the first playing training set comprises sample music fragments and first tag rhythm marks; the fourth sample acquisition module is used for acquiring a second rhythm training set, and the second rhythm training set comprises a sample action segment and a second label rhythm mark; the system comprises a preset network acquisition module, a music rhythm prediction module and a music rhythm prediction module, wherein the preset network acquisition module is used for acquiring a preset rhythm prediction network, and the rhythm prediction network comprises an action rhythm prediction module and the music rhythm prediction module, wherein a classification layer of the action rhythm prediction module and a classification layer of the music rhythm prediction module share weight parameters; the preset network training module is used for carrying out iterative training on the rhythm prediction network based on the sample music piece, the first label rhythm mark, the sample action piece and the second label rhythm mark until the rhythm prediction network meets preset conditions, and a rhythm prediction model is obtained.

In some embodiments, the preset network training module may include: the music mark prediction unit is used for inputting the sample music piece into the music rhythm prediction module to predict the music rhythm so as to obtain a predicted music rhythm mark; the action sign prediction unit is used for inputting the sample action segment into the action rhythm prediction module to predict the action rhythm, so as to obtain a predicted action rhythm sign; a target loss determination unit, configured to determine a target loss corresponding to the playing prediction network based on the predicted music tempo token, the predicted action tempo token, the first tag tempo token, and the second tag tempo token; and the iterative training unit is used for carrying out iterative training on the rhythm prediction network according to the target loss until the rhythm prediction network meets the preset condition.

In some embodiments, the target loss determination unit may be specifically configured to: determining a loss of music tempo based on a hamming distance between the predicted music tempo token and the first tag tempo token; determining an action tempo penalty based on a hamming distance between the predicted action tempo token and the second label tempo token; and determining the target loss corresponding to the playing prediction network according to the music rhythm loss and the action rhythm loss.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the apparatus and modules described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

In several embodiments provided by the present application, the coupling of the modules to each other may be electrical, mechanical, or other.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

According to the scheme provided by the application, the segment sequence of the target music can be obtained, wherein the segment sequence comprises a plurality of music segments, the segment sequence is searched based on the dance action diagram to obtain the dance action sequence consisting of dance segments matched with each music segment, and further, a plurality of transition sequences corresponding to the dance action sequence are generated according to key frames corresponding to each pair of adjacent dance segments in the dance action sequence. And inputting each transition sequence into the action completion model to perform action completion, so as to obtain the dance video corresponding to the target music. And the transition sequences of each pair of adjacent dance segments in the dance motion sequence are subjected to motion complement through the motion complement model, so that the motions between the mutually adjacent key frames in the transition sequences are complemented, the motion connection between the dance segments in the dance motion sequence is more accurate and natural, the splicing sense of the integral dance motion in the dance video is reduced, and the integral fitness of the dance motion is improved.

As shown in fig. 12, an embodiment of the present application further provides a computer apparatus 600, where the computer apparatus 600 includes a processor 610, a memory 620, a power source 630, and an input unit 640, and the memory 620 stores a computer program, and when the computer program is called by the processor 610, the computer program can implement the various method steps provided in the above embodiments. It will be appreciated by those skilled in the art that the structure of the computer device shown in the drawings does not constitute a limitation of the computer device, and may include more or less components than those illustrated, or may combine certain components, or may be arranged in different components. Wherein:

processor 610 may include one or more processing cores. The processor 610 connects various parts within the overall battery management system using various interfaces and lines, and overall controls the computer device by executing or executing instructions, programs, instruction sets, or program sets stored in the memory 620, invoking data stored in the memory 620, performing various functions of the battery management system and processing data, and performing various functions of the computer device and processing data. Alternatively, the processor 610 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 610 may integrate one or a combination of several of a central processor 610 (Central Processing Unit, CPU), an image processor 610 (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 610 and may be implemented solely by a single communication chip.

The Memory 620 may include a random access Memory 620 (Random Access Memory, RAM) or a Read-Only Memory 620 (Read-Only Memory). Memory 620 may be used to store instructions, programs, sets of instructions, or program sets. The memory 620 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described above, and the like. The storage data area may also store data created by the computer device in use, such as phonebook and audio video data, and the like. Accordingly, the memory 620 may also include a memory controller to provide the processor 610 with access to the memory 620.

The power supply 630 may be logically connected to the processor 610 through a power management system, so that functions of managing charging, discharging, and power consumption management are implemented through the power management system. The power supply 630 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

An input unit 640, the input unit 640 being operable to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

Although not shown, the computer device 600 may further include a display unit or the like, which is not described herein. In particular, in this embodiment, the processor 610 in the computer device loads executable files corresponding to the processes of one or more computer programs into the memory 620 according to the following instructions, and the processor 610 executes the data such as phonebook and audio and video data stored in the memory 620, so as to implement the various method steps provided in the foregoing embodiment.

As shown in fig. 13, an embodiment of the present application further provides a computer readable storage medium 700, where the computer readable storage medium 700 stores a computer program 710, and the computer program 710 may be called by a processor to perform various method steps provided by the embodiment of the present application.

The computer readable storage medium may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium comprises a Non-volatile computer readable storage medium (Non-Transitory Computer-Readable Storage Medium). The computer readable storage medium 700 has storage space for a computer program that performs any of the method steps in the embodiments described above. These computer programs may be read from or written to one or more computer program products. The computer program can be compressed in a suitable form.

According to one aspect of the present application, there is provided a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program so that the computer device performs the various method steps provided by the above embodiments.

Although the present application has been described in terms of the preferred embodiments, it should be understood that the present application is not limited to the specific embodiments, but is capable of numerous modifications and equivalents, and alternative embodiments and modifications of the embodiments described above, without departing from the spirit and scope of the present application.

Claims

1. A method of video generation, the method comprising:

acquiring a segment sequence of target music, wherein the segment sequence comprises a plurality of music segments;

Performing sequence search on the segment sequences based on the dance action graph to obtain dance action sequences composed of dance segments matched with each music segment;

generating a plurality of transition sequences corresponding to the dance action sequences according to key frames corresponding to each pair of adjacent dance segments in the dance action sequences;

inputting each transition sequence into an action complement model to carry out action complement, and obtaining a dance video corresponding to the target music;

the action completion model is obtained based on action distance loss training, and the action distance loss is determined by Euclidean distance of a predicted action sequence and a label action sequence.

2. The method of claim 1, wherein generating a plurality of transition sequences corresponding to the dance motion sequence from key frames corresponding to each pair of adjacent dance segments in the dance motion sequence, comprises:

acquiring a first key frame of a first dance segment from the dance action sequence, and taking the first key frame as a first frame;

acquiring a second key frame of a second dance segment from the dance action sequence, taking the second key frame as an end frame, wherein the first dance segment is a preface dance segment adjacent to the second dance segment;

Performing linear interpolation calculation based on the first frame and the last frame to obtain a plurality of intermediate frames;

and determining a transition sequence corresponding to the first dance segment and the second dance segment according to the first frame, the last frame and each intermediate frame so as to obtain a plurality of transition sequences corresponding to the dance action sequence.

3. The method of claim 1, wherein the action complement model is trained by:

obtaining a sample dance segment set, wherein the sample dance segment set comprises a plurality of pairs of sample dance segments, and each pair of sample dance segments comprises a first sample segment and a second sample segment;

performing linear interpolation calculation based on the first sample segment and the second sample segment, and determining a sample action sequence corresponding to each pair of sample dance segments;

inputting the sample action sequence into a preset action complement network for action complement, wherein the action complement network outputs a predicted action sequence corresponding to the sample action sequence;

determining the action distance loss of the action completion network according to the Euclidean distance of the predicted action sequence and the label action sequence;

And performing iterative training on the action completion network based on the action distance loss until the action completion network meets the preset condition to obtain an action completion model.

4. The method of claim 3, wherein the determining the sample action sequence corresponding to each pair of sample dancing segments based on the linear interpolation calculation of the first sample segment and the second sample segment comprises:

acquiring a first sample key frame of the first sample fragment, and taking the first sample key frame as a sample first frame;

acquiring a second sample key frame of the second sample segment, taking the second sample key frame as a sample tail frame, and taking the first sample segment as a preamble dance segment adjacent to the second sample segment;

performing linear interpolation calculation based on the sample first frame and the sample last frame to obtain a plurality of sample intermediate frames;

and determining a sample action sequence corresponding to each pair of sample dance fragments according to the sample first frame, the sample last frame and each sample middle frame.

5. The method according to claim 3 or 4, wherein the inputting the sample motion sequence into a preset motion complement network for motion complement, and the motion complement network outputs a predicted motion sequence corresponding to the sample motion sequence, includes:

Inputting the sample action sequence to a preset action complement network, and performing embedding operation on the sample action sequence to obtain an embedding vector corresponding to the sample action sequence;

extracting features of the embedded vectors to obtain feature vectors corresponding to the embedded vectors;

and carrying out convolution calculation on the characteristic vector to obtain a predicted action sequence.

6. The method of claim 5, wherein the inputting the sample motion sequence into a preset motion completion network performs an embedding operation on the sample motion sequence to obtain an embedding vector corresponding to the sample motion sequence, and includes:

inputting the sample action sequence to a convolution layer of a preset action completion network, and calculating to obtain a segmented representation;

inputting the sample action sequence to a first embedded layer of a preset action completion network, and calculating to obtain gesture expression;

inputting the sample action sequence to a second embedded layer of a preset action completion network, and calculating to obtain a key frame representation;

and obtaining an embedded vector corresponding to the sample action sequence based on the segment representation, the gesture representation and the key frame representation.

7. The method of claim 1, wherein the performing a sequence search on the segment sequence based on dance action map to obtain a dance action sequence composed of dance segments matched with each of the music segments comprises:

in dance action drawings, sequentially searching each music segment in the segment sequence according to a search function to obtain dance segments matched with each music segment;

the search function comprises at least one of a music matching function or an action matching function, wherein the music matching function is determined by a rhythm prediction model, and the action matching function is determined by Euclidean distances of two transition frames corresponding to two adjacent dance segments;

and obtaining a dance action sequence corresponding to the target music based on each dance segment.

8. The method of claim 7, wherein in the dance action drawing, sequentially searching each music piece in the sequence of pieces according to a search function to obtain a dance piece matched with each music piece, comprising:

in the dance action diagram, performing sequence search on an initial music segment in the segment sequence, and determining a dance segment with the smallest music matching function value with the initial music segment as an initial dance segment;

Taking the initial music segment as a target music segment and taking the initial dance segment as a target dance segment;

acquiring adjacent music fragments of the target music fragment in the fragment sequence, and acquiring an adjacent dance fragment set of the target dance fragment in the dance action diagram;

performing sequence search on the adjacent music fragments in the adjacent dance fragment set, and determining the adjacent dance fragments with the smallest sum of the music matching function values of the adjacent music fragments and the action matching function values of the target dance fragments as matched dance fragments;

updating the matched dancing segments into target dancing segments, updating the adjacent music segments into target music segments, and iteratively returning to execute the step of acquiring the adjacent music segments of the target music segments in the segment sequence until each music segment in the segment sequence is subjected to sequence search to obtain the dancing segments matched with each music segment.

9. The method of claim 8, wherein the performing a sequence search on the contiguous musical piece in the contiguous set of dance segments comprises:

Acquiring each adjacent dance segment in the adjacent dance segment set;

a music rhythm prediction module and an action rhythm prediction module based on a rhythm prediction model obtain a music matching function value between each adjacent dance segment and each adjacent music segment;

and obtaining an action matching function value between each adjacent dance segment and the dance segment matched with the target music segment based on Euclidean distance of the transition frame corresponding to each dance segment matched with the target music segment.

10. The method according to any one of claims 7 to 9, wherein the cadence prediction model is trained by:

acquiring a first playing training set, wherein the first playing training set comprises a sample music piece and a first label rhythm mark;

acquiring a second rhythm training set, wherein the second rhythm training set comprises a sample action segment and a second label rhythm mark;

acquiring a preset rhythm prediction network, wherein the rhythm prediction network comprises an action rhythm prediction module and a music rhythm prediction module, and a classification layer of the action rhythm prediction module and a classification layer of the music rhythm prediction module share weight parameters;

And performing iterative training on the rhythm prediction network based on the sample music piece, the first tag rhythm mark, the sample action piece and the second tag rhythm mark until the rhythm prediction network meets the preset condition, so as to obtain a rhythm prediction model.

11. The method of claim 10, wherein the iteratively training the tempo prediction network based on the sample piece of music, the first tagged tempo token, the sample action piece and the second tagged tempo token until the tempo prediction network meets a preset condition includes:

inputting the sample music piece into the music rhythm prediction module to predict the music rhythm, so as to obtain a predicted music rhythm mark;

inputting the sample action segment into the action rhythm prediction module to predict the action rhythm, so as to obtain a predicted action rhythm mark;

determining a target loss corresponding to the playing prediction network based on the predicted musical tempo token, the predicted action tempo token, the first tag tempo token, and the second tag tempo token;

and carrying out iterative training on the rhythm prediction network according to the target loss until the rhythm prediction network meets the preset condition.

12. The method of claim 9, wherein the determining the target loss for the play prediction network based on the predicted musical tempo token, the predicted action tempo token, the first label tempo token, and the second label tempo token comprises:

determining a music tempo penalty based on a hamming distance between the predicted music tempo token and the first tag tempo token;

determining an action tempo penalty based on a hamming distance between the predicted action tempo token and the second label tempo token;

and determining the target loss corresponding to the playing prediction network according to the music rhythm loss and the action rhythm loss.

13. A video generating apparatus, the apparatus comprising:

the system comprises a segment acquisition module, a segment generation module and a segment generation module, wherein the segment acquisition module is used for acquiring a segment sequence of target music, and the segment sequence comprises a plurality of music segments;

the sequence searching module is used for carrying out sequence searching on the segment sequences based on the dance action drawing to obtain a dance action sequence composed of dance segments matched with each music segment;

the sequence generation module is used for generating a plurality of transition sequences corresponding to the dance motion sequences according to key frames corresponding to each pair of adjacent dance segments in the dance motion sequences;

The video generation module is used for inputting each transition sequence into the action completion model to carry out action completion, so as to obtain dance videos corresponding to the target music;

14. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which is callable by a processor for executing the method according to any one of claims 1 to 12.

15. A computer device, comprising:

a memory;

one or more processors coupled with the memory;

one or more computer programs, wherein the one or more computer programs are stored in the memory and configured to be executed by the one or more processors, the one or more computer programs configured to perform the method of any of claims 1-12.

16. A computer program product, characterized in that the computer program product comprises a computer program, the computer program being stored in a storage medium; a processor of a computer device reads the computer program from a storage medium, the processor executing the computer program, causing the computer device to perform the method of any of claims 1-12.