CN115712739A

CN115712739A - Dance action generation method, computer device and storage medium

Info

Publication number: CN115712739A
Application number: CN202211441749.0A
Authority: CN
Inventors: 何艾莲; 林开来; 张悦; 黄均昕; 董治; 姜涛
Original assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Current assignee: Tencent Music Entertainment Technology Shenzhen Co Ltd
Priority date: 2022-11-17
Filing date: 2022-11-17
Publication date: 2023-02-24
Anticipated expiration: 2042-11-17
Also published as: WO2024103637A1; CN115712739B; WO2024103637A9

Abstract

The embodiment of the application discloses a dance action generation method, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring audio to be dance, and extracting a plurality of audio segments from the audio to be dance; inputting the plurality of audio segments into a pre-trained coding model to obtain a first action characteristic of each audio segment in the plurality of audio segments; according to the first action characteristic, determining a second action characteristic similar to the first action characteristic from a plurality of dance action characteristics stored in an action library in advance; and inputting the second action characteristic corresponding to each audio segment into a pre-trained decoding model to obtain a third action characteristic, and determining the dance action of the audio to be dance according to the third action characteristic. Through the method, dance motions can be automatically generated, the requirements of users on automation and intellectualization of dance motion generation are met, and the quality of dance motions is improved.

Description

Dance action generation method, computer device and storage medium

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a dance action generation method, a computer device, and a storage medium.

Background

Dance is a high-level art capable of transmitting emotion and composed of audio and dance actions, wherein how the audio and dance actions are matched becomes the key point and difficulty of dance choreography, and dance actions can be choreographed by professional dancers according to understanding of the audio emotion, however, the dance choreography mode is dependent on professional dancers, and ordinary dancers or users cannot finish choreography by themselves. Therefore, how to effectively implement automated choreography is very important.

Disclosure of Invention

The embodiment of the application provides a dance action generation method, computer equipment and a storage medium, which can automatically generate dance actions and improve the extraction of a plurality of audio segments in the dance actions;

the plurality of audio segments are input to a pre-trained encoding quality.

In a first aspect, an embodiment of the present application provides a dance action generating method, including:

obtaining an audio to be danced, and obtaining a first action characteristic of each audio segment in the multiple audio segments from the audio model to be danced, wherein the coding model is obtained by training a sample audio and a sample dancing action corresponding to the sample audio;

according to the first action characteristic of each audio clip, determining a second action characteristic similar to the first action characteristic of each audio clip from action characteristics of a plurality of dance actions stored in an action library in advance;

inputting the second motion characteristics corresponding to each audio segment into a pre-trained decoding model to obtain third motion characteristics, and determining dance motions of the audio to be danced according to the third motion characteristics, wherein the decoding model is obtained by training sample motion characteristics corresponding to the sample audio and the sample dance motions corresponding to the sample audio, the sample motion characteristics corresponding to the sample audio are obtained by encoding the sample audio by using the pre-trained encoding model, and the third motion characteristics are used for indicating the motion characteristics of all the audio segments.

In a second aspect, an embodiment of the present application provides a dance motion generating apparatus, including:

the device comprises an acquisition unit, a storage unit and a control unit, wherein the acquisition unit is used for acquiring audio to be dance and extracting a plurality of audio segments from the audio to be dance;

the encoding unit is used for inputting the plurality of audio segments into a pre-trained encoding model to obtain a first action characteristic of each audio segment in the plurality of audio segments, wherein the encoding model is obtained by training a sample audio and a sample dance action corresponding to the sample audio;

the determining unit is used for determining a second action characteristic similar to the first action characteristic of each audio clip from the action characteristics of a plurality of dance actions stored in an action library in advance according to the first action characteristic of each audio clip;

and the decoding unit is used for inputting the second motion characteristics corresponding to each audio segment into a pre-trained decoding model to obtain third motion characteristics, and determining the dance motion of the audio to be dance according to the third motion characteristics, wherein the decoding model is obtained by training sample motion characteristics corresponding to the sample audio and the sample dance motion corresponding to the sample audio, the sample motion characteristics corresponding to the sample audio are obtained by encoding the sample audio by using the pre-trained encoding model, and the third motion characteristics are used for indicating the motion characteristics of all the audio segments.

In a third aspect, an embodiment of the present application provides a computer device, where the computer device includes: a processor and a memory, the processor to perform:

acquiring audio to be danced, and extracting a plurality of audio segments from the audio to be danced;

inputting the plurality of audio segments into a pre-trained coding model to obtain a first action characteristic of each audio segment in the plurality of audio segments, wherein the coding model is obtained by training a sample audio and a sample dance action corresponding to the sample audio;

In a fourth aspect, an embodiment of the present application further provides a computer-readable storage medium, where program instructions are stored, and when the program instructions are executed, the computer-readable storage medium is configured to implement the method described in the first aspect.

The method and the device can acquire the audio to be danced and extract a plurality of audio clips from the audio to be danced; inputting a plurality of audio segments into a pre-trained coding model to obtain a first action characteristic of each audio segment in the plurality of audio segments, wherein the coding model is obtained by training sample audio and sample dance actions corresponding to the sample audio; according to the first action characteristic of each audio clip, determining a second action characteristic similar to the first action characteristic of each audio clip from the action characteristics of a plurality of dance actions stored in an action library in advance; and inputting the second motion characteristics corresponding to each audio clip into a pre-trained decoding model to obtain third motion characteristics, and determining dance motions of the audio to be dance according to the third motion characteristics, wherein the decoding model is obtained by training sample motion characteristics corresponding to the sample audio and sample dance motions corresponding to the sample audio, the sample motion characteristics corresponding to the sample audio are obtained by coding the sample audio by using the pre-trained coding model, and the third motion characteristics are used for indicating the motion characteristics of all the audio clips. Through the method, dance motions can be automatically generated, the requirements of users on automation and intellectualization of dance motion generation are met, and the quality of dance motions is improved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic flow chart of a dance action generating method provided in an embodiment of the present application;

FIG. 2 is a schematic diagram of a multi-frame dance motion;

FIG. 3 is a schematic diagram of a human body key point;

FIG. 4 is a flowchart illustrating another dance action generating method according to the embodiment of the application;

FIG. 5 is a schematic flow chart of yet another dance motion generation method provided in the embodiments of the present application;

FIG. 6 is a schematic flow chart of yet another dance motion generation method provided in the embodiments of the present application;

FIG. 7 is a schematic structural diagram of a dance motion generating apparatus according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below clearly with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large image processing technologies, operating/interactive systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Key technologies for Speech Technology (Speech Technology) are automatic Speech recognition Technology (ASR) and Speech synthesis Technology (TTS), as well as voiceprint recognition Technology. The computer can listen, see, speak and feel, and the development direction of the future human-computer interaction is provided, wherein the voice becomes one of the best human-computer interaction modes in the future.

Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language people use daily, so it has a close relation with the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question answering, knowledge mapping, and the like.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The method specially studies how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning/deep learning generally includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and formal education learning.

Based on the technologies such as machine learning mentioned in the artificial intelligence technology, the dance action generation scheme is provided, the pre-trained coding model is used for coding the audio to be dance to obtain a first action characteristic of the audio to be dance, a second action characteristic similar to the first action characteristic is determined from an action library obtained through model learning, and the second action characteristic is further decoded through a decoding model to generate the dance action of the audio to be dance. In this way, dance motions can be automatically generated, and the quality of the generated dance motions is improved.

The dance action generation method provided by the embodiment of the application can be applied to a dance action generation device, the dance action generation device can be arranged in computer equipment, and in some embodiments, the computer equipment can include but is not limited to intelligent terminal equipment such as a smart phone, a tablet computer, a notebook computer, a desktop computer, a vehicle-mounted intelligent terminal and a smart watch.

In some embodiments, the dance action generation method provided in the embodiments of the present application may be applied to scenes of dance choreography: for example, generating a dance action matched with the audio to be danced according to the audio to be danced, and the like. Of course, the above application scenarios are merely illustrative, and in other embodiments, dance motion generation of embodiments of the present application may be applied to any scenario associated with dance motion generation.

The dance action generation method provided by the embodiment of the application is schematically described below with reference to the drawings.

Specifically, referring to fig. 1, fig. 1 is a schematic flow chart of a dance action generating method provided in an embodiment of the present application, where the dance action generating method in the embodiment of the present application may be executed by a dance action generating device, where the dance action generating device is disposed in a terminal or a computer device, and a specific explanation of the terminal or the computer device is as described above. Specifically, the method of the embodiment of the present application includes the following steps.

S101: obtaining audio to be danced, and extracting a plurality of audio segments from the audio to be danced.

In the embodiment of the application, when the computer equipment extracts a plurality of audio frequency segments from the audio frequency to be dance, the beat information of the audio frequency to be dance can be obtained, and the plurality of audio frequency segments are extracted from the audio frequency to be dance according to the beat information.

In one embodiment, the computer device may extract a plurality of audio pieces from the audio to be dance according to the designated tempo when extracting the plurality of audio pieces from the audio to be dance according to the tempo information. For example, the specified tempo may be 1/2 tempo, and the computer device may extract a plurality of audio pieces corresponding to 1/2 tempo of each tempo in the audio to be dance. In other embodiments, the specified beat may be another beat, which is not specifically limited in this application.

According to the dance method and the dance device, the multiple audio segments are extracted from the audio to be danced, so that the matched dance action is generated for each audio segment, and the quality of the dance action generated for the audio to be danced is improved.

S102: the plurality of audio segments are input into a pre-trained coding model to obtain a first action characteristic of each of the plurality of audio segments.

In this embodiment, the computer device may input a plurality of audio segments into a pre-trained coding model to obtain a first motion characteristic of each of the plurality of audio segments, where the coding model is obtained by training an initial coding model through a sample dance motion corresponding to a sample audio and a sample audio. In some embodiments, the Data format of the first motion feature may include, but is not limited to, a matrix, a Polygon Mesh Data (MMD), a three-dimensional common model format (FilmBox, FBX), and the like.

In one embodiment, when the computer device inputs the plurality of audio segments into the pre-trained coding model to obtain the first motion feature of each of the plurality of audio segments, the computer device may input each of the plurality of audio segments into the pre-trained coding model to obtain a motion feature vector corresponding to each of the audio segments, and determine the first motion feature according to the motion feature vector corresponding to each of the audio segments.

Further, when the data form of the first motion feature is a matrix, the computer device may use the motion feature vector corresponding to each motion in each audio segment as a row vector when determining the first motion feature according to the motion feature vector corresponding to each audio segment, so as to compose the first motion feature according to the motion feature vector corresponding to each motion in each audio segment in a plurality of audio segments, wherein one audio segment may include one or more motions, and each motion corresponds to one motion feature vector. For example, the first motion characteristic may be a matrix of t × 512, t being the number of audio segments.

When the data form of the first action feature is a matrix, the number of columns of the matrix may be determined according to the number of actions in the action library, for example, if the number of actions in the action library is 64, then the first action feature matrix may be determined to be a matrix of t × 64.

S103: and according to the first action characteristic of each audio segment, determining a second action characteristic similar to the first action characteristic from the action characteristics of a plurality of dance actions stored in advance in the action library.

In the embodiment of the application, the action library stores the action characteristics of dance actions of a plurality of dance categories in advance, and the data form of the action characteristics includes but is not limited to a matrix. For example, the action library may be a matrix of T × 24 × 3, T indicating T frame dance actions, 24 indicating 24 key points of the dance actions in the human body, and 3 indicating coordinate positions with three dimensions for each dance action. Fig. 2 is a schematic diagram of a multi-frame dance motion, as shown in fig. 2, each individual dance motion corresponds to one frame of dance motion, and fig. 2 includes the multi-frame dance motion. As shown in FIG. 3, FIG. 3 is a schematic diagram of key points of a human body, and numbers 0-23 marked in FIG. 3 are used for indicating a plurality of key points of the human body, wherein the dance action is determined according to the position of each key point, and the key points at different positions can form a plurality of dance actions.

In one embodiment, the computer device may acquire a plurality of first motion feature vectors included in the first motion feature of each audio piece when determining a second motion feature similar to the first motion feature from among the motion features of a plurality of dance motions stored in the motion library in advance according to the first motion feature, wherein each first motion feature vector is used for indicating one motion; determining a second action characteristic vector corresponding to each first action characteristic vector from action characteristics of various dance actions stored in an action library in advance according to each first action characteristic vector of each audio clip; and determining a second motion characteristic of each audio segment according to each second motion characteristic vector of each audio segment.

In one embodiment, when determining a second motion feature vector corresponding to each first motion feature vector from motion features of a plurality of dance motions stored in a motion library in advance according to each first motion feature vector of each audio segment, the computer device may obtain a distance between each first motion feature vector of each audio segment and each motion feature vector in the motion library, wherein each motion feature vector in the motion library is used for indicating one dance motion stored in advance; and acquiring a motion characteristic vector with the shortest distance to each first motion characteristic vector of each audio clip from the motion library as a second motion characteristic vector of each audio clip.

In one embodiment, when obtaining the distance between each first motion feature vector of each audio segment and each motion feature vector in the motion library, the computer device may calculate the distance between each first motion feature vector of each audio segment and each motion feature vector in the motion library by using a euclidean algorithm.

In an embodiment, the first action feature is specifically a first action feature matrix, and the computer device may obtain, for each audio segment, each row vector in the first action feature matrix corresponding to the audio segment to obtain a plurality of first action feature vectors corresponding to the audio segment, where each row vector is used to indicate an action; determining a second action characteristic vector corresponding to the first action characteristic vector from action characteristics of various dance actions stored in an action library in advance aiming at each first action characteristic vector of each audio clip to obtain a plurality of second action characteristic vectors corresponding to each audio clip; and for each audio clip, combining a plurality of second motion characteristic vectors corresponding to the audio clip to obtain a second motion characteristic matrix corresponding to the audio clip, wherein the second motion characteristic matrix is used for representing the second motion characteristics.

According to the method and the device, the second action characteristic corresponding to the first action characteristic is determined from the action library, so that the subsequent decoding of the second action characteristic of each audio segment can be effectively performed according to the decoding model, the third action characteristic is obtained, and the dance action of the audio to be danced is further determined according to the third action characteristic.

S104: and inputting the second action characteristic corresponding to each audio segment into a pre-trained decoding model to obtain a third action characteristic, and determining the dance action of the audio to be dance according to the third action characteristic.

In this embodiment, the computer device may input the second motion characteristic corresponding to each audio segment into a pre-trained decoding model to obtain a third motion characteristic, and determine a dance motion of the audio to be dance according to the third motion characteristic, where the decoding model is obtained by training an initial decoding model according to a sample motion characteristic corresponding to a sample audio and a sample dance motion corresponding to the sample audio, the sample motion characteristic corresponding to the sample audio is obtained by encoding the sample audio using the pre-trained encoding model, and the third motion characteristic is used to indicate motion characteristics of all audio segments, where the third motion characteristic includes the number of key points of a human body in all audio segments and the position of each key point. In some embodiments, the third motion characteristic includes, but is not limited to, a third motion characteristic matrix, for example, the third motion characteristic may be a matrix of T × 24 × 3, where T is used to indicate the number of audio segments, 24 represents key points of the human body, and 3 is used to indicate three-dimensional coordinates of each key point.

When the computer equipment determines the dance motion of the audio to be danced according to the third motion characteristic, the number of key points of the human body in all audio segments and the position of each key point can be determined according to the third motion characteristic; and determining the dance motion of the audio to be danced according to the number of the key points of the human body in all the audio segments and the position of each key point. For example, assuming that the third motion characteristic is a matrix of T × 24 × 3, the computer device may determine the number of key points of the human body and the position of each key point in all audio clips according to the matrix of T × 24 × 3, determine the human body motion in all audio clips according to the number of key points of the human body and the position of each key point in all audio clips, and further determine the human body motion of all audio clips as the dance motion of the audio to be dance.

According to the dance action generating method and device, the audio clip of the audio to be danced is coded through the pre-trained coding model, the corresponding first action characteristic is obtained, the second action characteristic corresponding to the first action characteristic is determined from the action characteristics of the dance actions stored in the action library in advance, and the dance actions of the audio to be danced are generated accurately and high in quality through decoding of the pre-trained decoding model by means of the second action characteristic.

Referring to fig. 4, fig. 4 is a schematic flow chart of another dance motion generation method provided in the embodiment of the present application, and the dance motion generation method in the embodiment of the present application may be executed by a dance motion generation apparatus, where the dance motion generation apparatus is disposed in a terminal or a computer device, and a specific explanation of the terminal or the computer device is as described above. Specifically, the embodiment of the present application mainly describes a training process of a coding model, and specifically includes the following steps.

S401: obtaining audio to be danced, and extracting a plurality of audio segments from the audio to be danced.

S402: the method comprises the steps of obtaining a sample data set, wherein the sample data set comprises a plurality of sample dance music data, and each sample dance music data comprises sample audio and sample dance actions.

S403: and training the initial coding model according to the sample audio frequency and the sample dance actions of each sample dance music data to obtain a pre-trained coding model.

In one embodiment, the computer device may extract sample motion characteristics from the sample dance motion of each sample dance data when training the initial coding model according to the sample audio frequency and the sample dance motion of each sample dance data to obtain a pre-trained coding model; and inputting the sample motion characteristics and the sample audio into the first coding model for training to obtain a pre-trained coding model. In certain embodiments, the sample motion characteristics include, but are not limited to, a dataform of a matrix.

In one embodiment, when extracting sample action features from sample dance actions of each sample dance music data, the computer device may obtain the number of key points and the positions of the key points of a human body corresponding to each sample dance action of each sample dance music data, where the positions of the key points include coordinates of each key point; and inputting the number of key points and the positions of the key points of the human body corresponding to each sample dance motion into the initial coding model, and extracting to obtain sample motion characteristics. In some embodiments, the key points of the human body of each sample dance action may include 24 key points of the human body, and in some embodiments, the key point positions of each sample dance action may include three-dimensional coordinate data of each sample dance action.

In one embodiment, when the computer device trains the sample motion characteristics and the sample audio input initial coding model to obtain a pre-trained coding model, the computer device may train the sample motion characteristics and the sample audio input initial coding model to obtain first sample motion characteristics, and determine second sample motion characteristics similar to the first sample motion characteristics from motion characteristics of a plurality of dance motions pre-stored in a motion library according to the first sample motion characteristics; inputting the second sample action characteristic into an initial decoding model to obtain a third sample action characteristic; and adjusting the model parameters of the initial decoding model according to the third sample action characteristics, and inputting the second sample action characteristics into the adjusted decoding model for training to obtain a pre-trained decoding model. In some embodiments, the third sample motion feature is used to indicate the number of keypoints of the human body in the sample audio and the location of each keypoint.

In one embodiment, when the computer device adjusts the model parameters of the initial decoding model according to the motion characteristics of the third sample, and inputs the motion characteristics of the second sample into the adjusted decoding model for training to obtain a pre-trained decoding model, the computer device may determine the dance motion of the sample audio according to the motion characteristics of the third sample; comparing the determined dance motion of the sample audio with the sample dance motion of the sample audio, and adjusting the model parameters of the initial decoding model according to the comparison result; and inputting the motion characteristics of the second sample into the decoding model after the model parameters are adjusted, and retraining to obtain a pre-trained decoding model.

When determining the dance motion of the sample audio according to the third sample motion characteristic, the computer device may determine the number of key points of the human body and the position of each key point in the sample audio according to the third sample motion characteristic; and determining the dance motion of the sample audio according to the number of the key points of the human body in the sample audio and the position of each key point.

Further, the computer device may compare the motion characteristic matrix of the dance motion of the sample audio with the motion characteristic of the sample dance motion when comparing the determined dance motion of the sample audio with the sample dance motion of the sample audio.

Further, the computer device may calculate vector distances of respective vectors in the motion characteristics of the dance motion of the sample audio and respective vectors of the motion characteristics of the dance motion of the sample audio when adjusting the model parameters of the first coding model according to the comparison result, and adjust the model parameters of the initial coding model according to the vector distances of the vectors in the motion characteristics of the dance motion of the sample audio and the vectors of the motion characteristics matrix of the dance motion of the sample when the vector distances of the vectors in the motion characteristics of the dance motion of the sample audio and the vectors of the motion characteristics of the dance motion of the sample are greater than a first distance threshold.

The mapping relation between the dance motions of each sample and the motion characteristics of the sample can be obtained through the initial coding model, the initial coding model can be trained according to the motion characteristics of the sample and the audio of the sample, the pre-trained coding model can be obtained, and the mapping relation among the motion characteristic matrixes of the dance motions of the sample, the audio of the sample and the audio of the sample can be obtained.

S404: the plurality of audio segments are input into a pre-trained coding model to obtain a first action characteristic of each of the plurality of audio segments.

S405: and determining a second action characteristic similar to the first action characteristic of each audio segment from the action characteristics of a plurality of dance actions stored in the action library in advance according to the first action characteristic of each audio segment.

S406: and inputting the second motion characteristic of each audio segment into a pre-trained decoding model to obtain a third motion characteristic, and determining the dance motion of the audio to be dance according to the third motion characteristic.

According to the embodiment of the application, the mapping relation between each sample dance action and the sample action characteristic is obtained through the initial coding model, the pre-trained coding model is obtained by training the initial coding model according to the sample action characteristic and the sample audio, and the mapping relation among the sample action characteristic, the sample audio and the first sample action characteristic is obtained, so that the first action characteristic of the audio to be coded is generated through the pre-trained coding model during testing.

Referring to fig. 5, fig. 5 is a schematic flow chart of another dance motion generation method provided in the embodiment of the present application, and the dance motion generation method in the embodiment of the present application may be executed by a dance motion generation apparatus, where the dance motion generation apparatus is disposed in a terminal or a computer device, and a specific explanation of the terminal or the computer device is as described above. Specifically, the embodiment of the present application mainly describes a training process of a decoding model, and specifically includes the following steps.

S501: obtaining audio to be danced, and extracting a plurality of audio segments from the audio to be danced.

S502: and inputting the plurality of audio segments into a pre-trained coding model to obtain a first action characteristic of each audio segment in the plurality of audio segments.

S503: and determining a second action characteristic similar to the first action characteristic of each audio segment from the action characteristics of a plurality of dance actions stored in the action library in advance according to the first action characteristic of each audio segment.

S504: and inputting the sample audio into a pre-trained coding model to obtain a first sample action characteristic corresponding to the sample audio.

S505: and training the preset decoding model according to the first sample action characteristic to obtain the pre-trained decoding model.

In one embodiment, when the computer device trains the initial decoding model according to the first sample action characteristics to obtain a pre-trained decoding model, the computer device may determine, according to the first sample action characteristics, second sample action characteristics similar to the first sample action characteristics from action characteristics of a plurality of dance actions stored in an action library in advance; and inputting the second sample motion characteristics into the initial decoding model for training to obtain a pre-trained decoding model.

In one embodiment, when the computer device inputs the second sample motion characteristics into the initial decoding model for training to obtain the pre-trained decoding model, the computer device may input the second sample motion characteristics into the initial decoding model to obtain third sample motion characteristics; and adjusting the model parameters of the initial decoding model according to the third sample action characteristics, and inputting the second sample action characteristics into the adjusted decoding model for training to obtain the pre-trained decoding model.

In one embodiment, the computer device may determine a dance motion of the sample audio based on the third sample motion characteristic; comparing the determined dance motion of the sample audio with the sample dance motion of the sample audio, and adjusting the model parameters of the initial decoding model according to the comparison result; and inputting the motion characteristics of the second sample into the decoding model after the model parameters are adjusted, and retraining to obtain a pre-trained decoding model.

When the computer equipment determines the dance movement of the sample audio according to the third sample movement characteristic, the number of key points of the human body in the sample audio and the position of each key point can be determined according to the third sample movement characteristic; and determining the dance motion of the sample audio according to the number of the key points of the human body in the sample audio and the position of each key point.

In one embodiment, the computer device may adjust the model parameters of the initial decoding model based on the third sample motion characteristics and the sample motion characteristics when adjusting the model parameters based on the third sample motion characteristics.

Further, the computer device may adjust the model parameters of the initial decoding model according to a vector distance between a vector in the third sample motion feature matrix and a vector in the sample motion feature matrix when adjusting the model parameters of the initial decoding model according to the third sample motion feature and the sample motion feature.

Further, when a vector distance between the vector in the third sample motion feature and the vector in the sample motion feature is greater than a second distance threshold, the computer device may adjust the model parameters of the initial decoding model according to the vector distance between the vector in the third sample motion feature and the vector in the sample motion feature.

S506: and inputting the second action characteristic corresponding to each audio segment into a pre-trained decoding model to obtain a third action characteristic, and determining the dance action of the audio to be dance according to the third action characteristic.

According to the embodiment of the application, the first sample action characteristic of the sample audio is generated through the pre-trained coding model, the initial decoding model is trained according to the first sample action characteristic, the pre-trained decoding model is obtained, and the third action characteristic of the audio to be dance can be generated more accurately during testing, so that the dance action with more accuracy and higher quality can be generated.

Referring to fig. 6, fig. 6 is a flowchart of another dance motion generation method provided in this embodiment of the present application, a dance motion to be programmed is determined according to third motion characteristics by obtaining an audio 61 to be programmed, extracting a plurality of audio segments 62 from the audio to be programmed, inputting the plurality of audio segments into a pre-trained coding model to obtain a first motion characteristic 63 of each audio segment in the plurality of audio segments, determining a second motion characteristic similar to the first motion characteristic of each audio segment from motion characteristics of a plurality of dance motions stored in a motion library in advance, inputting the second motion characteristic of each audio segment into a pre-trained decoding model to obtain third motion characteristics 64 of all audio segments.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a dance motion generating apparatus according to an embodiment of the present application. Specifically, the dance motion generating apparatus is provided in a computer device, and includes: an acquisition unit 701, an encoding unit 702, a determination unit 703, and a decoding unit 704;

the device comprises an acquisition unit 701, a processing unit and a control unit, wherein the acquisition unit 701 is used for acquiring audio to be dance and extracting a plurality of audio segments from the audio to be dance;

an encoding unit 702, configured to input the multiple audio segments into a pre-trained encoding model, to obtain a first motion characteristic of each of the multiple audio segments, where the encoding model is obtained by training a sample audio and a sample dance motion corresponding to the sample audio;

a determining unit 703, configured to determine, according to the first motion feature of each audio segment, a second motion feature similar to the first motion feature of each audio segment from among motion features of multiple dance motions stored in a motion library in advance;

a decoding unit 704, configured to input the second motion characteristic corresponding to each audio segment into a pre-trained decoding model, obtain a third motion characteristic, and determine a dance motion of the audio to be dance according to the third motion characteristic, where the decoding model is obtained by training a sample motion characteristic corresponding to the sample audio and the sample dance motion corresponding to the sample audio, the sample motion characteristic corresponding to the sample audio is obtained by encoding the sample audio using the pre-trained encoding model, and the third motion characteristic is used to indicate motion characteristics of all audio segments.

Further, when the obtaining unit 701 extracts a plurality of audio segments from the audio to be danced, the obtaining unit is specifically configured to:

acquiring beat information of the audio to be dance;

and extracting a plurality of audio clips from the audio to be danced according to the beat information.

Further, the first motion characteristic is specifically a first motion characteristic matrix; when the determining unit 703 determines, according to the first motion feature of each audio segment, a second motion feature similar to the first motion feature of each audio segment from the motion features of multiple dance motions stored in the motion library in advance, the determining unit is specifically configured to:

for each audio clip, obtaining each row vector in a first action characteristic matrix corresponding to the audio clip to obtain a plurality of first action characteristic vectors corresponding to the audio clip, wherein each row vector is used for indicating an action;

for each first action characteristic vector of each audio clip, determining a second action characteristic vector corresponding to the first action characteristic vector from action characteristics of various dance actions stored in the action library in advance, and obtaining a plurality of second action characteristic vectors corresponding to each audio clip;

and for each audio clip, combining a plurality of second motion characteristic vectors corresponding to the audio clip to obtain a second motion characteristic matrix corresponding to the audio clip, wherein the second motion characteristic matrix is used for representing the second motion characteristics.

Further, when the determining unit 703 determines the second motion feature vector corresponding to the first motion feature vector from the motion features of the multiple dance motions stored in the motion library in advance, the determining unit is specifically configured to:

obtaining the distance between the first action characteristic vector and each action characteristic vector in the action library, wherein each action characteristic vector in the action library is used for indicating a pre-stored dance action;

and acquiring the motion characteristic vector with the shortest distance to the first motion characteristic vector from the motion library, and taking the motion characteristic vector as a second motion characteristic vector corresponding to the first motion characteristic vector.

Further, before the encoding unit 702 inputs the plurality of audio segments into the pre-trained encoding model and obtains the first motion feature matrix associated with each of the plurality of audio segments, it is further configured to:

acquiring a sample data set, wherein the sample data set comprises a plurality of sample dance music data, and each sample dance music data comprises sample audio and sample dance actions;

extracting sample action characteristics from the sample dance actions of each sample dance music data;

and inputting the sample action characteristics and the sample audio into an initial coding model for training to obtain the pre-trained coding model.

Further, the encoding unit 702 is configured to, when extracting sample dance motion features from the sample dance motions of each sample dance music data:

acquiring the number of key points and the positions of the key points of the human body corresponding to each sample dance action;

and inputting the number of key points and the positions of the key points of the human body corresponding to each sample dance action into the initial coding model, and extracting the sample action characteristics.

Further, before the decoding unit 704 inputs the second motion feature corresponding to each audio segment into the pre-trained decoding model to obtain the third motion feature, the decoding unit is further configured to:

inputting the sample audio into the pre-trained coding model to obtain a first sample action characteristic of the sample audio; determining a second sample action characteristic similar to the first sample action characteristic from an action library according to the first sample action characteristic;

inputting the second sample action characteristic into an initial decoding model to obtain a third sample action characteristic; determining dance motions of the sample audio according to the third sample motion characteristics;

and comparing the determined dance movement of the sample audio with the dance movement of the sample audio, and adjusting the model parameters of the initial decoding model according to the comparison result to obtain the pre-trained decoding model.

Further, when the decoding unit 704 determines, according to the third motion characteristic, the dance motion of the audio to be danced, specifically, the decoding unit is configured to:

determining the number of key points of a human body in the audio to be danced and the position of each key point according to the third action characteristics;

and determining the dance motion of the audio to be danced according to the number of the key points of the human body in the audio to be danced and the position of each key point.

According to the dance action generating method and device, the pre-trained coding model is used for coding the audio segments of the audio to be dance, the first action characteristics of each audio segment are obtained, the second action characteristics similar to the first action characteristics are determined from the action characteristics of various dance actions stored in the action library in advance, and the second action characteristics are used for generating the dance actions of the audio to be dance with higher accuracy and higher quality through decoding of the pre-trained decoding model.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. Specifically, the computer device includes: memory 801, processor 802.

In one embodiment, the computer device further comprises a data interface 803, the data interface 803 being used for transferring data information between the computer device and other devices.

The memory 801 may include a volatile memory (volatile memory); the memory 801 may also include a non-volatile memory (non-volatile memory); the memory 801 may also comprise a combination of memories of the kind described above. The processor 802 may be a Central Processing Unit (CPU). The processor 802 may further include a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), or any combination thereof.

The memory 801 is used for storing programs, and the processor 802 can call the programs stored in the memory 801 for executing the following steps:

inputting the second motion characteristics corresponding to each audio clip into a pre-trained decoding model to obtain third motion characteristics, and determining dance motions of the audio to be danced according to the third motion characteristics, wherein the decoding model is obtained by training sample motion characteristics corresponding to the sample audio and the sample dance motions corresponding to the sample audio, the sample motion characteristics corresponding to the sample audio are obtained by encoding the sample audio by using the pre-trained encoding model, and the third motion characteristics are used for indicating the motion characteristics of all the audio clips.

Further, when the processor 802 extracts a plurality of audio segments from the audio to be danced, it is specifically configured to:

acquiring beat information of the audio to be dance;

Further, the first action characteristic is specifically a first action characteristic matrix; when the processor 802 determines, according to the first motion characteristic of each audio segment, a second motion characteristic similar to the first motion characteristic of each audio segment from the motion characteristics of a plurality of dance motions stored in the motion library in advance, the processor is specifically configured to:

for each first motion characteristic vector of each audio clip, determining a second motion characteristic vector corresponding to the first motion characteristic vector from motion characteristics of multiple dance motions stored in the motion library in advance to obtain a plurality of second motion characteristic vectors corresponding to each audio clip;

Further, when the processor 802 determines the second motion feature vector corresponding to the first motion feature vector from the motion features of the dance motions stored in the motion library in advance, the second motion feature vector is specifically configured to:

Further, before the processor 802 inputs the plurality of audio segments into the pre-trained coding model and obtains the first motion feature matrix associated with each of the plurality of audio segments, it is further configured to:

the method comprises the steps of obtaining a sample data set, wherein the sample data set comprises a plurality of sample dance music data, and each sample dance music data comprises a sample audio and a sample dance action;

Further, the processor 802, when extracting the sample dance motion feature from the sample dance motion of each sample dance music data, is configured to:

acquiring the number and the positions of key points of the human body corresponding to each sample dance action;

and inputting the number of key points and the positions of the key points of the human body corresponding to each sample dance motion into the initial coding model, and extracting the motion characteristics of the samples.

Further, before the processor 802 inputs the second motion feature corresponding to each audio segment into the pre-trained decoding model to obtain the third motion feature, the processor is further configured to:

and comparing the determined dance motion of the sample audio with the sample dance motion of the sample audio, and adjusting the model parameters of the initial decoding model according to the comparison result to obtain the pre-trained decoding model.

Further, when determining the dance motion of the audio to be danced according to the third motion characteristic, the processor 802 is specifically configured to:

and determining the dance action of the audio to be danced according to the number of the key points of the human body in the audio to be danced and the position of each key point.

According to the dance action generating method and device, the pre-trained coding model is used for coding the audio segments of the to-be-dance audio to obtain the first action characteristic of each audio segment, the second action characteristic similar to the first action characteristic of each audio segment is determined from the action characteristics of the multiple dance actions stored in the action library in advance, and the second action characteristic of each audio segment is used for generating the dance action of the to-be-dance audio with higher accuracy and higher quality through decoding of the pre-trained decoding model.

An embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method described in the embodiment corresponding to fig. 1, fig. 4, or fig. 5 in the present application is implemented, and also the apparatus in the embodiment corresponding to the present application in fig. 7 may be implemented, which is not described herein again.

The computer readable storage medium may be an internal storage unit of the device according to any of the foregoing embodiments, for example, a hard disk or a memory of the device. The computer readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the apparatus. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

While only some embodiments have been described in detail herein, it will be understood that all modifications and equivalents may be resorted to, falling within the scope of the invention.

Claims

1. A dance action generating method, comprising:

acquiring audio to be dance, and extracting a plurality of audio segments from the audio to be dance;

2. The method of claim 1, wherein extracting a plurality of audio segments from the audio to be danced comprises:

acquiring beat information of the audio to be dance;

3. The method according to claim 1, characterized in that the first action profile is in particular a first action profile matrix; the determining, according to the first motion characteristic of each audio clip, a second motion characteristic similar to the first motion characteristic of each audio clip from among motion characteristics of a plurality of dance motions stored in an action library in advance includes:

4. The method of claim 3, wherein determining a second motion feature vector corresponding to the first motion feature vector from motion features of a plurality of dance motions pre-stored in the motion library comprises:

and acquiring the motion characteristic vector with the shortest distance to the first motion characteristic vector from the motion library as a second motion characteristic vector corresponding to the first motion characteristic vector.

5. The method of claim 1, wherein before inputting the plurality of audio segments into a pre-trained coding model to obtain the first action feature matrix associated with each of the plurality of audio segments, the method further comprises:

6. The method of claim 5, wherein the extracting sample action features from the sample dance actions of each sample dance music data comprises:

7. The method of claim 1, wherein before inputting the second motion feature corresponding to each audio segment into a pre-trained decoding model to obtain a third motion feature, the method further comprises:

inputting the second sample motion characteristics into an initial decoding model to obtain third sample motion characteristics; determining dance motions of the sample audio according to the third sample motion characteristics;

8. The method of claim 1, wherein determining the dance motion of the audio to be danced according to the third motion characteristic comprises:

determining the number of key points of a human body in the audio to be dance and the position of each key point according to the third action characteristic;

9. A computer device comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program, the computer program comprising a program, the processor being configured to invoke the program to perform the method according to any one of claims 1-8.

10. A computer-readable storage medium, having stored thereon program instructions for implementing the method of any one of claims 1-8 when executed.