WO2024103637A1 - 舞蹈动作生成方法、计算机设备及存储介质 - Google Patents

舞蹈动作生成方法、计算机设备及存储介质 Download PDF

Info

Publication number
WO2024103637A1
WO2024103637A1 PCT/CN2023/090889 CN2023090889W WO2024103637A1 WO 2024103637 A1 WO2024103637 A1 WO 2024103637A1 CN 2023090889 W CN2023090889 W CN 2023090889W WO 2024103637 A1 WO2024103637 A1 WO 2024103637A1
Authority
WO
WIPO (PCT)
Prior art keywords
audio
action
sample
dance
feature
Prior art date
Application number
PCT/CN2023/090889
Other languages
English (en)
French (fr)
Other versions
WO2024103637A9 (zh
Inventor
何艾莲
林开来
张悦
黄均昕
董治
姜涛
Original Assignee
腾讯音乐娱乐科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯音乐娱乐科技(深圳)有限公司 filed Critical 腾讯音乐娱乐科技(深圳)有限公司
Publication of WO2024103637A1 publication Critical patent/WO2024103637A1/zh
Publication of WO2024103637A9 publication Critical patent/WO2024103637A9/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/40Information retrieval; Database structures therefor; File system structures therefor of multimedia data, e.g. slideshows comprising image and additional audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Definitions

  • the present application relates to the field of artificial intelligence, and in particular to a dance movement generation method, a computer device and a storage medium.
  • the embodiments of the present application provide a dance movement generation method, a computer device, and a storage medium, which can automatically generate dance movements and improve the quality of extracting multiple audio clips from the dance movements; and input the multiple audio clips into pre-trained encoding.
  • an embodiment of the present application provides a dance movement generation method, comprising:
  • the second action feature corresponding to each audio clip is input into a pre-trained decoding model to obtain a third action feature, and the dance action of the audio to be choreographed is determined according to the third action feature, wherein the decoding model is trained by the sample action features corresponding to the sample audio and the sample dance action corresponding to the sample audio, the sample action features corresponding to the sample audio are obtained by encoding the sample audio using the pre-trained encoding model, and the third action feature is configured to indicate the action features of all audio clips.
  • an embodiment of the present application provides a dance movement generating device, comprising:
  • An acquisition unit configured to acquire the audio to be choreographed, and extract a plurality of audio segments from the audio to be choreographed
  • an encoding unit configured to input the plurality of audio clips into a pre-trained encoding model to obtain a first motion feature of each of the plurality of audio clips, wherein the encoding model is trained by sample audio and sample dance motions corresponding to the sample audio;
  • a determining unit configured to determine, according to the first action feature of each audio segment, a second action feature similar to the first action feature of each audio segment from action features of a plurality of dance actions pre-stored in an action library;
  • the decoding unit is configured to input the second action feature corresponding to each audio clip into a pre-trained decoding model to obtain a third action feature, and determine the dance action of the audio to be choreographed according to the third action feature, wherein the decoding model is trained by the sample action feature corresponding to the sample audio and the sample dance action corresponding to the sample audio, and the sample action feature corresponding to the sample audio is obtained by using the pre-trained encoding model to encode the sample audio.
  • the third action feature is obtained by encoding, and is configured to indicate the action features of all audio clips.
  • an embodiment of the present application provides a computer device, the computer device comprising: a processor and a memory, the processor being configured to execute:
  • the second action feature corresponding to each audio clip is input into a pre-trained decoding model to obtain a third action feature, and the dance action of the audio to be choreographed is determined according to the third action feature, wherein the decoding model is trained by the sample action features corresponding to the sample audio and the sample dance action corresponding to the sample audio, the sample action features corresponding to the sample audio are obtained by encoding the sample audio using the pre-trained encoding model, and the third action feature is configured to indicate the action features of all audio clips.
  • an embodiment of the present application further provides a computer-readable storage medium, in which program instructions are stored, and when the program instructions are executed, they are configured to implement the method described in the first aspect above.
  • the embodiment of the present application can obtain the audio to be choreographed, and extract multiple audio clips from the audio to be choreographed; input the multiple audio clips into a pre-trained encoding model to obtain the first action feature of each audio clip in the multiple audio clips, wherein the encoding model is trained by the sample audio and the sample dance action corresponding to the sample audio; according to the first action feature of each audio clip, determine the second action feature similar to the first action feature of each audio clip from the action features of multiple dance actions pre-stored in the action library; input the second action feature corresponding to each audio clip into a pre-trained decoding model to obtain the third action feature, and determine the dance action of the audio to be choreographed according to the third action feature, wherein the decoding model is trained by the sample action feature corresponding to the sample audio and the sample dance action corresponding to the sample audio, the sample action feature corresponding to the sample audio is obtained by encoding the sample audio using the pre-trained encoding model, and the third action feature is configured to indicate the action feature
  • FIG1 is a schematic diagram of a flow chart of a dance movement generation method provided in an embodiment of the present application.
  • Fig. 2 is a schematic diagram of a multi-frame dance action
  • Fig. 3 is a schematic diagram of a key point of a human body
  • FIG4 is a flow chart of another method for generating dance movements provided in an embodiment of the present application.
  • FIG5 is a flow chart of another dance movement generation method provided in an embodiment of the present application.
  • FIG6 is a flow chart of another dance movement generation method provided in an embodiment of the present application.
  • FIG7 is a schematic structural diagram of a dance movement generating device provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
  • Artificial Intelligence is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is a comprehensive technology in computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can respond in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines so that machines have the functions of perception, reasoning and decision-making.
  • Artificial intelligence technology is a comprehensive discipline that covers a wide range of fields, including both hardware-level and software-level technologies.
  • the basic technologies of artificial intelligence generally include sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, large image processing technology, operation/interaction systems, mechatronics and other technologies.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • ASR automatic speech recognition technology
  • TTS text-to-speech technology
  • voiceprint recognition technology Enabling computers to listen, see, speak and feel is the future development direction of human-computer interaction, among which speech has become one of the most promising human-computer interaction methods in the future.
  • Natural language processing is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that can achieve effective communication between people and computers using natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Therefore, research in this field will involve natural language, that is, the language people use in daily life, so it is closely related to the study of linguistics. Natural language processing technology usually includes text processing, semantic understanding, machine translation, robot question answering, knowledge graph and other technologies.
  • Machine Learning is a multi-disciplinary subject that involves probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other disciplines. It specializes in studying how computers simulate or implement human learning behavior to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance.
  • Machine learning is the core of artificial intelligence and the fundamental way to make computers intelligent. Its applications are spread across all areas of artificial intelligence. Machine learning/deep learning usually includes artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and learning by teaching.
  • this application proposes a dance movement generation scheme, which encodes the audio to be choreographed through a pre-trained encoding model to obtain the first action feature of the audio to be choreographed, and determines the second action feature similar to the first action feature from the action library obtained through model learning, and further decodes the second action feature through a decoding model to generate the dance movement of the audio to be choreographed.
  • a dance movement generation scheme which encodes the audio to be choreographed through a pre-trained encoding model to obtain the first action feature of the audio to be choreographed, and determines the second action feature similar to the first action feature from the action library obtained through model learning, and further decodes the second action feature through a decoding model to generate the dance movement of the audio to be choreographed.
  • the dance movement generation method provided in the embodiments of the present application can be configured as a dance movement generation device, which can be set in a computer device.
  • the computer device can include but is not limited to smart phones, tablet computers, laptop computers, desktop computers, in-vehicle smart terminals, smart watches and other smart terminal devices.
  • the dance movement generation method provided in the embodiments of the present application can be configured as a dance choreography scenario: for example, a dance movement matching the audio to be choreographed is generated according to the audio to be choreographed, etc.
  • a dance movement matching the audio to be choreographed is generated according to the audio to be choreographed, etc.
  • the above application scenarios are only examples, and in other embodiments, the dance movement generation in the embodiments of the present application can be applied to any scenario associated with dance movement generation.
  • Figure 1 is a flow chart of a dance movement generation method provided in an embodiment of the present application.
  • the dance movement generation method in the embodiment of the present application can be performed by a dance movement generation device, wherein the dance movement generation device is set in a terminal or a computer device, wherein the terminal or the computer device is specifically explained as above.
  • the method in the embodiment of the present application includes the following steps.
  • S101 Acquire audio to be choreographed, and extract multiple audio clips from the audio to be choreographed.
  • the computer device when extracting multiple audio segments from the audio to be choreographed, can obtain the beat information of the audio to be choreographed, and extract multiple audio segments from the audio to be choreographed according to the beat information.
  • the computer device may extract multiple audio clips from the audio to be choreographed according to the specified beat.
  • the specified beat may be a 1/2 beat
  • the computer device may extract multiple audio clips corresponding to a 1/2 beat of each beat in the audio to be choreographed.
  • the specified beat may be other beats, which is not specifically limited in this application.
  • the present application helps to improve the quality of dance movements generated for the audio to be choreographed by extracting multiple audio segments from the audio to be choreographed to generate matching dance movements for each audio segment.
  • S102 Input multiple audio clips into a pre-trained encoding model to obtain a first action feature of each audio clip in the multiple audio clips.
  • a computer device may input a plurality of audio clips into a pre-trained coding model to obtain a first motion feature of each of the plurality of audio clips, wherein the coding model is obtained by training an initial coding model with sample audio and sample dance movements corresponding to the sample audio.
  • the data form of the first motion feature may include, but is not limited to, matrix, polygon mesh (MMD), three-dimensional general model format (FilmBox, FBX) and other data forms.
  • a computer device when a computer device inputs multiple audio clips into a pre-trained encoding model to obtain a first action feature of each of the multiple audio clips, it can input each of the multiple audio clips into the pre-trained encoding model to obtain an action feature vector corresponding to each audio clip, and determine the first action feature based on the action feature vector corresponding to each audio clip.
  • the action feature vector corresponding to each action in each audio segment can be used as a row vector, so as to form the first action feature according to the action feature vector corresponding to each action in each audio segment of multiple audio segments, wherein an audio segment can include one or more actions, and each action corresponds to an action feature vector.
  • the first action feature can be a t*512 matrix, where t is the number of audio segments.
  • the number of columns of the matrix can be determined according to the number of actions in the action library. For example, if the number of actions in the action library is 64, the first action feature matrix can be determined to be a t*64 matrix.
  • a second action feature similar to the first action feature is determined from action features of a plurality of dance actions pre-stored in an action library.
  • the action library pre-stores the action features of dance actions of multiple dance categories, and the data form of the action features includes but is not limited to a matrix.
  • the action library can be a matrix of T*24*3, T is configured to indicate T frames of dance actions, 24 is configured to indicate 24 key points of the dance action in the human body, and 3 is configured to indicate that each dance action has a three-dimensional coordinate position.
  • Figure 2 is a schematic diagram of a multi-frame dance action, as shown in Figure 2, each human body action corresponds to a frame of dance action, and Figure 2 includes multiple frames of dance actions.
  • the computer device selects a plurality of dance movements pre-stored in the movement library according to the first movement feature.
  • multiple first action feature vectors included in the first action feature of each audio clip can be obtained, wherein each first action feature vector is configured to indicate an action; and according to each first action feature vector of each audio clip, a second action feature vector corresponding to each first action feature vector is determined from the action features of multiple dance actions pre-stored in the action library; and the second action feature of each audio clip is determined according to each second action feature vector of each audio clip.
  • the computer device when a computer device determines, based on each first action feature vector of each audio clip, the second action feature vector corresponding to each first action feature vector from action features of multiple dance actions pre-stored in an action library, the computer device can obtain the distance between each first action feature vector of each audio clip and each action feature vector in the action library, where each action feature vector in the action library is configured to indicate a pre-stored dance action; and obtain the action feature vector with the shortest distance to each first action feature vector of each audio clip from the action library as the second action feature vector of each audio clip.
  • the computer device when obtaining the distance between each first action feature vector of each audio clip and each action feature vector in the action library, can use the Euclidean algorithm to calculate the distance between each first action feature vector of each audio clip and each action feature vector in the action library.
  • the first action feature is specifically a first action feature matrix.
  • the computer device can obtain, for each audio clip, each row vector in the first action feature matrix corresponding to the audio clip, to obtain multiple first action feature vectors corresponding to the audio clip, wherein each row vector is configured to indicate an action; for each first action feature vector of each audio clip, determine the second action feature vector corresponding to the first action feature vector from the action features of multiple dance actions pre-stored in the action library, to obtain multiple second action feature vectors corresponding to each audio clip; for each audio clip, combine the multiple second action feature vectors corresponding to the audio clip to obtain a second action feature matrix corresponding to the audio clip, wherein the second action feature matrix is configured to represent the second action feature.
  • the present application determines the second action feature corresponding to the first action feature from the action library, which helps to more effectively decode the second action feature of each audio clip according to the decoding model, obtain the third action feature, and further determine the dance action of the audio to be choreographed based on the third action feature.
  • S104 Input the second motion feature corresponding to each audio clip into the pre-trained decoding model to obtain a third motion feature, and determine the dance motion of the audio to be choreographed according to the third motion feature.
  • the computer device may input the second motion feature corresponding to each audio clip into a pre-trained decoding model to obtain a third motion feature, and determine the dance motion of the audio to be choreographed based on the third motion feature, wherein the decoding model is obtained by training the initial decoding model with the sample motion feature corresponding to the sample audio and the sample dance motion corresponding to the sample audio, and the sample motion feature corresponding to the sample audio is obtained by encoding the sample audio using a pre-trained encoding model, and the third motion feature is configured to indicate the motion feature of all audio clips, wherein the third motion feature includes the number of key points of the human body and the position of each key point in all audio clips.
  • the third motion feature includes but is not limited to a third motion feature matrix, for example, the third motion feature may be a matrix of T*24*3, wherein T is configured to indicate the number of audio clips, 24 represents the key points of the human body, and 3 is configured to indicate the three-dimensional coordinates of each key point.
  • the computer device determines the dance movements of the audio to be choreographed according to the third movement feature, it can determine the number of key points of the human body and the position of each key point in all audio clips according to the third movement feature; and determine the dance movements of the audio to be choreographed according to the number of key points of the human body and the position of each key point in all audio clips.
  • the computer device can determine the number of key points of the human body and the position of each key point in all audio clips according to the T*24*3 matrix, and determine the human body movements in all audio clips according to the number of key points of the human body and the position of each key point in all audio clips, and further determine the human body movements in all audio clips as the dance movements of the audio to be choreographed.
  • the embodiment of the present application encodes the audio segment of the audio to be choreographed through a pre-trained encoding model to obtain a corresponding first action feature, and determines a second action feature corresponding to the first action feature from the action features of a variety of dance actions pre-stored in an action library, which helps to use the second action feature to generate more accurate and high-quality dance actions for the audio to be choreographed through decoding by a pre-trained decoding model.
  • FIG 4 is a flow chart of another dance movement generation method provided in an embodiment of the present application.
  • the dance movement generation method in the embodiment of the present application can be performed by a dance movement generation device, wherein the dance movement generation device is set in a terminal or a computer device, wherein the specific explanation of the terminal or the computer device is as above.
  • the embodiment of the present application is mainly an explanation of the training process of the coding model, which specifically includes the following steps.
  • S401 Acquire the audio to be choreographed, and extract multiple audio clips from the audio to be choreographed.
  • S402 Acquire a sample data set, where the sample data set includes a plurality of sample dance music data, each of which includes sample audio and sample dance movements.
  • the sample action feature can be extracted from the sample dance movement of each sample dance music data; the sample action feature and the sample audio are input into the first coding model for training to obtain the pre-trained coding model.
  • the sample action feature includes but is not limited to the data form of a matrix.
  • the computer device may obtain the number of key points and key point positions of the human body corresponding to each sample dance action of each sample dance music data, wherein the key point positions include the coordinates of each key point; and input the number of key points and key point positions of the human body corresponding to each sample dance action into an initial coding model to extract the sample action features.
  • the key points of the human body of each sample dance action may include 24 key points of the human body, and in some embodiments, the key point positions of each sample dance action may include three-dimensional coordinate data of each sample dance action.
  • the sample action feature and the sample audio can be input into the initial coding model for training to obtain the first sample action feature, and the second sample action feature similar to the first sample action feature is determined from the action features of multiple dance actions pre-stored in the action library according to the first sample action feature; the second sample action feature is input into the initial decoding model to obtain the third sample action feature; the model parameters of the initial decoding model are adjusted according to the third sample action feature, and the second sample action feature is input into the adjusted decoding model for training to obtain the pre-trained decoding model.
  • the third sample action feature is configured to indicate the number of key points of the human body in the sample audio and the position of each key point.
  • the computer device when the computer device adjusts the model parameters of the initial decoding model according to the third sample motion feature and inputs the second sample motion feature into the adjusted decoding model for training to obtain a pre-trained decoding model, the computer device can determine the dance movements of the sample audio according to the third sample motion feature; compare the determined dance movements of the sample audio with the sample dance movements of the sample audio, and adjust the model parameters of the initial decoding model according to the comparison result; and input the second sample motion feature into the decoding model after the model parameters are adjusted to re-train the decoding model to obtain the pre-trained decoding model.
  • the computer device determines the dance movements of the sample audio based on the third sample movement feature, it can determine the number of key points of the human body and the position of each key point in the sample audio based on the third sample movement feature; and determine the dance movements of the sample audio based on the number of key points of the human body and the position of each key point in the sample audio.
  • the computer device may compare the movement feature matrix of the dance movements of the sample audio with the movement features of the sample dance movements.
  • the sample The vector distance between each vector in the motion feature of the dance movement of the audio and each vector in the motion feature of the sample dance movement.
  • the model parameters of the initial encoding model are adjusted according to the vector distance between the vector in the motion feature of the dance movement of the sample audio and the vector of the motion feature matrix of the sample dance movement.
  • the present application can obtain the mapping relationship between each sample dance movement and the sample movement feature through the initial coding model, which helps to train the initial coding model according to the sample movement feature and the sample audio, obtain the pre-trained coding model, and obtain the mapping relationship between the sample movement feature, the sample audio, and the movement feature matrix of the dance movement of the sample audio.
  • S404 Input the multiple audio clips into a pre-trained encoding model to obtain a first motion feature of each of the multiple audio clips.
  • S405 According to the first action feature of each audio clip, determine a second action feature similar to the first action feature of each audio clip from action features of multiple dance actions pre-stored in an action library.
  • S406 Input the second motion feature of each audio clip into the pre-trained decoding model to obtain a third motion feature, and determine the dance motion of the audio to be choreographed according to the third motion feature.
  • the embodiment of the present application obtains the mapping relationship between each sample dance movement and the sample movement feature through the initial coding model, which helps to train the initial coding model according to the sample movement feature and the sample audio, obtain the pre-trained coding model, and obtain the mapping relationship between the sample movement feature, the sample audio, and the first sample movement feature, so as to generate the first movement feature of the audio to be encoded through the pre-trained coding model during testing.
  • FIG. 5 is a flow chart of another dance movement generation method provided in an embodiment of the present application.
  • the dance movement generation method in the embodiment of the present application can be performed by a dance movement generation device, wherein the dance movement generation device is set in a terminal or a computer device, wherein the specific explanation of the terminal or the computer device is as above.
  • the embodiment of the present application mainly describes the training process of the decoding model, which specifically includes the following steps.
  • S501 Acquire the audio to be choreographed, and extract multiple audio clips from the audio to be choreographed.
  • S502 Input multiple audio clips into a pre-trained encoding model to obtain a first action feature of each audio clip in the multiple audio clips.
  • S503 According to the first action feature of each audio clip, determine a second action feature similar to the first action feature of each audio clip from action features of multiple dance actions pre-stored in an action library.
  • S504 Input the sample audio into a pre-trained encoding model to obtain a first sample action feature corresponding to the sample audio.
  • S505 Train a preset decoding model according to the first sample action feature to obtain a pre-trained decoding model.
  • a computer device when a computer device trains an initial decoding model according to a first sample motion feature to obtain a pre-trained decoding model, it can determine a second sample motion feature similar to the first sample motion feature from motion features of a variety of dance movements pre-stored in a motion library according to the first sample motion feature; and input the second sample motion feature into the initial decoding model for training to obtain a pre-trained decoding model.
  • the second sample action feature when the computer device inputs the second sample action feature into the initial decoding model for training to obtain a pre-trained decoding model, can be input into the initial decoding model to obtain a third sample action feature; the model parameters of the initial decoding model are adjusted according to the third sample action feature, and the second sample action feature is input into the adjusted decoding model for training to obtain the pre-trained decoding model.
  • a computer device can determine the dance movements of the sample audio based on the third sample movement feature; compare the determined dance movements of the sample audio with the sample dance movements of the sample audio, and adjust the model parameters of the initial decoding model based on the comparison result; input the second sample movement feature into the decoding model after the model parameters are adjusted to retrain the decoding model to obtain a pre-trained decoding model.
  • the computer device determines the dance action of the sample audio according to the third sample action feature
  • the computer device can The number of key points of the human body and the position of each key point in the sample audio are determined by using features; the dance movements of the sample audio are determined according to the number of key points of the human body and the position of each key point in the sample audio.
  • the computer device may adjust the model parameters of the initial decoding model according to the third sample motion feature and the sample motion feature.
  • the computer device adjusts the model parameters of the initial decoding model according to the third sample motion features and the sample motion features, it can adjust the model parameters of the initial decoding model according to the vector distance between the vectors in the third sample motion feature matrix and the vectors in the sample motion feature matrix.
  • the computer device may adjust the model parameters of the initial decoding model according to the vector distance between the vector in the third sample motion feature and the vector in the sample motion feature.
  • S506 Input the second motion feature corresponding to each audio clip into the pre-trained decoding model to obtain a third motion feature, and determine the dance motion of the audio to be choreographed according to the third motion feature.
  • the embodiment of the present application generates a first sample motion feature of sample audio through a pre-trained encoding model, and trains an initial decoding model based on the first sample motion feature to obtain a pre-trained decoding model, which helps to more accurately generate a third motion feature of the audio to be choreographed during testing, thereby generating more accurate and higher-quality dance movements.
  • Figure 6 is a flow chart of another dance movement generation method provided in an embodiment of the present application, by obtaining the audio to be choreographed 61, and extracting multiple audio clips 62 from the audio to be choreographed, the multiple audio clips are input into a pre-trained encoding model, and a first action feature 63 of each audio clip in the multiple audio clips is obtained, and a second action feature similar to the first action feature of each audio clip is determined from the action features of multiple dance actions pre-stored in the action library, and the second action feature of each audio clip is input into the pre-trained decoding model to obtain a third action feature 64 of all audio clips, thereby determining the dance movement of the audio to be choreographed according to the third action feature.
  • Figure 7 is a schematic diagram of the structure of a dance movement generation device provided in an embodiment of the present application.
  • the dance movement generation device is set in a computer device, and the device includes: an acquisition unit 701, an encoding unit 702, a determination unit 703, and a decoding unit 704;
  • the acquisition unit 701 is configured to acquire the audio to be choreographed and extract multiple audio clips from the audio to be choreographed;
  • the encoding unit 702 is configured to input the multiple audio clips into a pre-trained encoding model to obtain a first motion feature of each audio clip in the multiple audio clips, wherein the encoding model is trained by sample audio and sample dance movements corresponding to the sample audio;
  • a determining unit 703 is configured to determine, according to the first action feature of each audio segment, a second action feature similar to the first action feature of each audio segment from action features of a plurality of dance actions pre-stored in an action library;
  • the decoding unit 704 is configured to input the second action feature corresponding to each audio clip into a pre-trained decoding model to obtain a third action feature, and determine the dance action of the audio to be choreographed based on the third action feature, wherein the decoding model is trained by the sample action features corresponding to the sample audio and the sample dance action corresponding to the sample audio, the sample action features corresponding to the sample audio are obtained by encoding the sample audio using the pre-trained encoding model, and the third action feature is configured to indicate the action features of all audio clips.
  • the acquisition unit 701 extracts multiple audio clips from the audio to be choreographed
  • the specific configuration is as follows:
  • a plurality of audio segments are extracted from the audio to be choreographed according to the beat information.
  • the first action feature is specifically a first action feature matrix; when the determination unit 703 determines, according to the first action feature of each audio clip, a second action feature similar to the first action feature of each audio clip from the action features of a plurality of dance actions pre-stored in the action library, the specific configuration is as follows:
  • each of the audio clips For each of the audio clips, acquiring each row vector in the first action feature matrix corresponding to the audio clip, to obtain a plurality of first action feature vectors corresponding to the audio clip, wherein each of the row vectors is configured to indicate an action;
  • a plurality of second motion feature vectors corresponding to the audio segment are combined to obtain a second motion feature matrix corresponding to the audio segment, wherein the second motion feature matrix is configured to represent the second motion feature.
  • the specific configuration is as follows:
  • each motion feature vector in the motion library is configured to indicate a pre-stored dance motion
  • An action feature vector having the shortest distance to the first action feature vector is obtained from the action library as a second action feature vector corresponding to the first action feature vector.
  • the encoding unit 702 inputs the multiple audio segments into the pre-trained encoding model to obtain the first action feature matrix of each audio segment in the multiple audio segments, it is further configured as follows:
  • sample data set includes a plurality of sample dance music data, each sample dance music data includes sample audio and sample dance movements;
  • the sample motion features and the sample audio are input into an initial coding model for training to obtain the pre-trained coding model.
  • the encoding unit 702 extracts sample action features from the sample dance action of each sample dance music data, it is configured as follows:
  • the number and positions of key points of the human body corresponding to each sample dance movement are input into the initial encoding model to extract the sample movement features.
  • the decoding unit 704 inputs the second action feature corresponding to each audio segment into the pre-trained decoding model to obtain the third action feature, it is further configured as follows:
  • the determined dance movements of the sample audio are compared with the sample dance movements of the sample audio, and the model parameters of the initial decoding model are adjusted according to the comparison result to obtain the pre-trained decoding model.
  • the decoding unit 704 determines the dance action of the to-be-choreographed audio according to the third action feature
  • the specific configuration is as follows:
  • the dance movements of the audio to be choreographed are determined according to the number of key points of the human body in the audio to be choreographed and the position of each key point.
  • the embodiment of the present application encodes the audio clips of the audio to be choreographed through a pre-trained encoding model to obtain a first motion feature of each audio clip, and determines a second motion feature similar to the first motion feature from the motion features of multiple dance movements pre-stored in an action library, which helps to use the second motion feature to generate more accurate and high-quality dance movements of the audio to be choreographed through decoding by a pre-trained decoding model.
  • Fig. 8 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.
  • the computer device includes: a memory 801 and a processor 802.
  • the computer device further includes a data interface 803, and the data interface 803 is configured to transfer data information between the computer device and other devices.
  • the memory 801 may include a volatile memory; the memory 801 may also include a non-volatile memory; the memory 801 may also include a combination of the above-mentioned types of memories.
  • the processor 802 may be a central processing unit (CPU).
  • the processor 802 may further include a hardware chip.
  • the above-mentioned hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof.
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA) or any combination thereof.
  • the memory 801 is configured to store a program, and the processor 802 can call the program stored in the memory 801 and is configured to perform the following steps:
  • the second action feature corresponding to each audio clip is input into a pre-trained decoding model to obtain a third action feature, and the dance action of the audio to be choreographed is determined according to the third action feature, wherein the decoding model is trained by the sample action features corresponding to the sample audio and the sample dance action corresponding to the sample audio, the sample action features corresponding to the sample audio are obtained by encoding the sample audio using the pre-trained encoding model, and the third action feature is configured to indicate the action features of all audio clips.
  • the processor 802 extracts a plurality of audio segments from the audio to be choreographed, the specific configuration is as follows:
  • a plurality of audio segments are extracted from the audio to be choreographed according to the beat information.
  • the first action feature is specifically a first action feature matrix; when the processor 802 determines, according to the first action feature of each audio clip, a second action feature similar to the first action feature of each audio clip from the action features of a plurality of dance actions pre-stored in the action library, the specific configuration is as follows:
  • each of the audio clips For each of the audio clips, acquiring each row vector in the first action feature matrix corresponding to the audio clip, to obtain a plurality of first action feature vectors corresponding to the audio clip, wherein each of the row vectors is configured to indicate an action;
  • each audio clip multiple second action feature vectors corresponding to the audio clip are combined to obtain A second action feature matrix corresponding to the audio clip, wherein the second action feature matrix is configured to represent the second action feature.
  • the processor 802 determines the second action feature vector corresponding to the first action feature vector from the action features of the plurality of dance actions pre-stored in the action library, the specific configuration is as follows:
  • each motion feature vector in the motion library is configured to indicate a pre-stored dance motion
  • An action feature vector having the shortest distance to the first action feature vector is obtained from the action library as a second action feature vector corresponding to the first action feature vector.
  • the processor 802 inputs the multiple audio segments into the pre-trained encoding model to obtain the first motion feature matrix of each audio segment in the multiple audio segments, it is further configured as follows:
  • sample data set includes a plurality of sample dance music data, each sample dance music data includes sample audio and sample dance movements;
  • the sample motion features and the sample audio are input into an initial coding model for training to obtain the pre-trained coding model.
  • processor 802 extracts sample action features from the sample dance action of each sample dance music data, it is configured as follows:
  • the number and positions of key points of the human body corresponding to each sample dance movement are input into the initial encoding model to extract the sample movement features.
  • the processor 802 inputs the second action feature corresponding to each audio segment into the pre-trained decoding model to obtain the third action feature, it is further configured as follows:
  • the determined dance movements of the sample audio are compared with the sample dance movements of the sample audio, and the model parameters of the initial decoding model are adjusted according to the comparison result to obtain the pre-trained decoding model.
  • the processor 802 determines the dance action of the to-be-choreographed audio according to the third action feature, the specific configuration is as follows:
  • the dance movements of the audio to be choreographed are determined according to the number of key points of the human body in the audio to be choreographed and the position of each key point.
  • the embodiment of the present application encodes the audio clips of the audio to be choreographed through a pre-trained encoding model to obtain a first motion feature of each audio clip, and determines a second motion feature similar to the first motion feature of each audio clip from the motion features of multiple dance movements pre-stored in an action library, which helps to use the second motion feature of each audio clip to generate more accurate and high-quality dance movements for the audio to be choreographed through decoding by a pre-trained decoding model.
  • An embodiment of the present application further provides a computer-readable storage medium, which stores a computer program.
  • the computer program When executed by a processor, it implements the method described in the embodiment corresponding to Figure 1, Figure 4 or Figure 5 of the present application, and can also implement the device corresponding to the embodiment of the present application described in Figure 7, which will not be repeated here.
  • the computer-readable storage medium may be an internal storage unit of the device described in any of the foregoing embodiments, such as a hard disk or memory of the device.
  • the computer-readable storage medium may also be an external storage device of the device, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash card (Flash Card), etc. equipped on the device.
  • the computer-readable storage medium may also include both an internal storage unit of the device and an external storage device.
  • the computer-readable storage medium is configured to store the computer program and other programs and data required by the terminal.
  • the computer-readable storage medium may also be configured to temporarily store data that has been output or is to be output.
  • the storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Databases & Information Systems (AREA)
  • Processing Or Creating Images (AREA)

Abstract

本申请实施例公开了一种舞蹈动作生成方法、计算机设备及存储介质,该方法包括:获取待编舞音频,并从待编舞音频中提取多个音频片段;将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征;根据第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与第一动作特征相似的第二动作特征;将每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据第三动作特征确定待编舞音频的舞蹈动作。通过这种方式可以自动化地生成舞蹈动作,满足了用户对生成舞蹈动作的自动化、智能化需求,并提高了舞蹈动作的质量。

Description

舞蹈动作生成方法、计算机设备及存储介质
本申请要求于2022年11月17日提交中国专利局、申请号为202211441749.0、申请名称为“舞蹈动作生成方法、计算机设备及存储介质的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种舞蹈动作生成方法、计算机设备及存储介质。
背景技术
舞蹈是由音频和舞蹈动作组成的一种可以传递情感的高级艺术,其中,音频和舞蹈动作如何匹配成为舞蹈编排的重点和难点,对于专业的舞蹈演员可以根据自己对音频情感的理解编排舞蹈动作,然而,这种人工编排舞蹈的方式依赖于专业的舞蹈演员,对于普通舞者或者用户无法自己完成舞蹈编排。因此,如何有效地实现对舞蹈的自动化编排非常重要。
发明内容
本申请实施例提供了一种舞蹈动作生成方法、计算机设备及存储介质,可以自动化地生成舞蹈动作,并提高了舞蹈动作的中提取多个音频片段;将所述多个音频片段输入预训练的编码质量。
第一方面,本申请实施例提供了一种舞蹈动作生成方法,包括:
获取待编舞音频,并从所述待编舞音频模型,得到所述多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和所述样本音频对应的样本舞蹈动作训练得到的;
根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征;
将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据所述第三动作特征确定所述待编舞音频的舞蹈动作,其中,所述解码模型是由所述样本音频对应的样本动作特征以及所述样本音频对应的所述样本舞蹈动作训练得到的,所述样本音频对应的样本动作特征是利用所述预训练的编码模型对所述样本音频进行编码得到的,所述第三动作特征配置为指示所有音频片段的动作特征。
第二方面,本申请实施例提供了一种舞蹈动作生成装置,包括:
获取单元,配置为获取待编舞音频,并从所述待编舞音频中提取多个音频片段;
编码单元,配置为将所述多个音频片段输入预训练的编码模型,得到所述多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和所述样本音频对应的样本舞蹈动作训练得到的;
确定单元,配置为根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征;
解码单元,配置为将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据所述第三动作特征确定所述待编舞音频的舞蹈动作,其中,所述解码模型是由所述样本音频对应的样本动作特征以及所述样本音频对应的所述样本舞蹈动作训练得到的,所述样本音频对应的样本动作特征是利用所述预训练的编码模型对所述样本音频 进行编码得到的,所述第三动作特征配置为指示所有音频片段的动作特征。
第三方面,本申请实施例提供了一种计算机设备,所述计算机设备包括:处理器和存储器,所述处理器配置为执行:
获取待编舞音频,并从所述待编舞音频中提取多个音频片段;
将所述多个音频片段输入预训练的编码模型,得到所述多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和所述样本音频对应的样本舞蹈动作训练得到的;
根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征;
将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据所述第三动作特征确定所述待编舞音频的舞蹈动作,其中,所述解码模型是由所述样本音频对应的样本动作特征以及所述样本音频对应的所述样本舞蹈动作训练得到的,所述样本音频对应的样本动作特征是利用所述预训练的编码模型对所述样本音频进行编码得到的,所述第三动作特征配置为指示所有音频片段的动作特征。
第四方面,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质中存储有程序指令,该程序指令被执行时,配置为实现上述第一方面所述的方法。
本申请实施例可以获取待编舞音频,并从待编舞音频中提取多个音频片段;将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征,其中,编码模型是由样本音频和样本音频对应的样本舞蹈动作训练得到的;根据每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与每个音频片段的第一动作特征相似的第二动作特征;将每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据第三动作特征确定待编舞音频的舞蹈动作,其中,解码模型是由样本音频对应的样本动作特征以及样本音频对应的样本舞蹈动作训练得到的,样本音频对应的样本动作特征是利用预训练的编码模型对样本音频进行编码得到的,第三动作特征配置为指示所有音频片段的动作特征。通过这种方式可以自动化地生成舞蹈动作,满足了用户对生成舞蹈动作的自动化、智能化需求,并提高了舞蹈动作的质量。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种舞蹈动作生成方法的流程示意图;
图2是一种多帧舞蹈动作的示意图;
图3是一种人体关键点的示意图;
图4是本申请实施例提供的另一种舞蹈动作生成方法的流程示意图;
图5是本申请实施例提供的又一种舞蹈动作生成方法的流程示意图;
图6是本申请实施例提供的又一种舞蹈动作生成方法的流程示意图;
图7是本申请实施例提供的一种舞蹈动作生成装置的结构示意图;
图8是本申请实施例提供的一种计算机设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个综合技术,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式做出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。
人工智能技术是一门综合学科,涉及领域广泛,既有硬件层面的技术也有软件层面的技术。人工智能基础技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大图像处理技术、操作/交互系统、机电一体化等技术。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
语音技术(Speech Technology)的关键技术有自动语音识别技术(ASR)和语音合成技术(TTS)以及声纹识别技术。让计算机能听、能看、能说、能感觉,是未来人机交互的发展方向,其中语音成为未来最被看好的人机交互方式之一。
自然语言处理(Nature Language processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究能实现人与计算机之间用自然语言进行有效通信的各种理论和方法。自然语言处理是一门融语言学、计算机科学、数学于一体的科学。因此,这一领域的研究将涉及自然语言,即人们日常使用的语言,所以它与语言学的研究有着密切的联系。自然语言处理技术通常包括文本处理、语义理解、机器翻译、机器人问答、知识图谱等技术。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习/深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
基于上述人工智能技术中所提及的机器学习等技术,本申请提出了一种舞蹈动作生成方案,通过预训练的编码模型对待编舞音频进行编码得到待编舞音频的第一动作特征,并从通过模型学习得到的动作库中确定出与第一动作特征相似的第二动作特征,进一步通过解码模型对第二动作特征进行解码,生成该待编舞音频的舞蹈动作。通过这种方式,可以实现自动生成舞蹈动作,并提高了生成的舞蹈动作的质量。
本申请实施例提供的舞蹈动作生成方法可以应配置为一种舞蹈动作生成装置,该舞蹈动作生成装置可设置于计算机设备中,在某些实施例中,该计算机设备可以包括但不限于智能手机、平板电脑、笔记本电脑、台式电脑、车载智能终端、智能手表等智能终端设备。
在某些实施例中,本申请实施例提供的舞蹈动作生成方法可以应配置为舞蹈编排的场景:例如根据待编舞音频生成与该待编舞音频匹配的舞蹈动作等。当然,以上应用场景仅仅是示例说明,在其他实施例中,本申请实施例的舞蹈动作生成可以应用到任意与舞蹈动作生成相关联的场景中。
下面结合附图对本申请实施例提供的舞蹈动作生成方法进行示意性说明。
具体请参见图1,图1是本申请实施例提供的一种舞蹈动作生成方法的流程示意图,本申请实施例的舞蹈动作生成方法可以由舞蹈动作生成装置执行,其中,舞蹈动作生成装置设置于终端或计算机设备中,其中,终端或计算机设备的具体解释如前。具体地,本申请实施例的方法包括如下步骤。
S101:获取待编舞音频,并从该待编舞音频中提取多个音频片段。
本申请实施例中,计算机设备在从该待编舞音频中提取多个音频片段时,可以获取待编舞音频的节拍信息,并根据节拍信息从待编舞音频中提取多个音频片段。
在一个实施例中,计算机设备在根据节拍信息从待编舞音频中提取多个音频片段时,可以根据指定节拍从待编舞音频中提取多个音频片段。例如,该指定节拍可以为1/2节拍,计算机设备可以提取待编舞音频中每个节拍的1/2节拍对应的多个音频片段。在其他实施例中,该指定节拍可以为其他节拍,本申请对此不做具体限定。
本申请通过从待编舞音频中提取多个音频片段,以为每个音频片段生成匹配的舞蹈动作,有助于提高为待编舞音频生成的舞蹈动作的质量。
S102:将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征。
本申请实施例中,计算机设备可以将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和样本音频对应的样本舞蹈动作对初始的编码模型进行训练得到的。在某些实施例中,该第一动作特征的数据形式可以包括但不限于矩阵、多边形网格(Polygon Mesh Data,MMD)、三维通用模型格式(FilmBox,FBX)等数据形式。
在一个实施例中,计算机设备在将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征时,可以将多个音频片段中的每个音频片段输入预训练的编码模型,得到与每个音频片段对应的动作特征向量,并根据每个音频片段对应的动作特征向量确定第一动作特征。
进一步地,当第一动作特征的数据形式为矩阵时,计算机设备在根据每个音频片段对应的动作特征向量确定第一动作特征时,可以将每个音频片段中每个动作对应的动作特征向量作为行向量,从而根据多个音频片段中每个音频片段的每个动作对应的动作特征向量组成第一动作特征,其中,一个音频片段可以包括一个或多个动作,每个动作对应一个动作特征向量。例如,第一动作特征可以为t*512的矩阵,t为音频片段的数量。
当第一动作特征的数据形式为矩阵时,矩阵的列数可以根据动作库中的动作的数量来确定,例如,动作库中的动作的数量为64,则可以确定第一动作特征矩阵为t*64的矩阵。
S103:根据每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与第一动作特征相似的第二动作特征。
本申请实施例中,该动作库中是预先存储多种舞蹈类别的舞蹈动作的动作特征,该动作特征的数据形式包括但不限于矩阵。例如,该动作库可以为T*24*3的矩阵,T配置为指示T帧舞蹈动作,24配置为指示人体中该舞蹈动作的24个关键点,3配置为指示每个舞蹈动作具有三个维度的坐标位置。如图2所示,图2是一种多帧舞蹈动作的示意图,如图2所示,每个人体动作对应为一帧舞蹈动作,图2包括多帧舞蹈动作。如图3所示,图3是一种人体关键点的示意图,图3中标注的0-23的数字配置为指示人体的多个关键点,其中,舞蹈动作是根据每个关键点的位置确定的,不同位置的各个关键点可以组成多个舞蹈动作。
在一个实施例中,计算机设备在根据第一动作特征,从动作库预先存储的多种舞蹈动作 的动作特征中确定与第一动作特征相似的第二动作特征时,可以获取每个音频片段的第一动作特征中包括的多个第一动作特征向量,其中,每个第一动作特征向量配置为指示一个动作;并根据每个音频片段的每个第一动作特征向量从动作库预先存储的多种舞蹈动作的动作特征中确定与每个第一动作特征向量对应的第二动作特征向量;以及根据每个音频片段的每个第二动作特征向量确定每个音频片段的第二动作特征。
在一个实施例中,计算机设备在根据每个音频片段的每个第一动作特征向量,从动作库预先存储的多种舞蹈动作的动作特征中确定与每个第一动作特征向量对应的第二动作特征向量时,可以获取每个音频片段的每个第一动作特征向量与动作库中每个动作特征向量的距离,该动作库中的每个动作特征向量配置为指示预先存储的一个舞蹈动作;从动作库中获取与每个音频片段的每个第一动作特征向量距离最短的动作特征向量作为每个音频片段的第二动作特征向量。
在一个实施例中,计算机设备在获取每个音频片段的每个第一动作特征向量与动作库中每个动作特征向量的距离时,可以利用欧几里得算法计算每个音频片段的每个第一动作特征向量与动作库中每个动作特征向量的距离。
在一个实施例中,第一动作特征具体为第一动作特征矩阵,计算机设备可以针对每个音频片段,获取音频片段对应的第一动作特征矩阵中的各个行向量,得到音频片段对应的多个第一动作特征向量,其中,每个行向量配置为指示一个动作;针对每个音频片段的每个第一动作特征向量,从动作库预先存储的多种舞蹈动作的动作特征中确定与第一动作特征向量对应的第二动作特征向量,得到每个音频片段对应的多个第二动作特征向量;针对每个音频片段,将音频片段对应的多个第二动作特征向量进行组合,得到音频片段对应的第二动作特征矩阵,其中,第二动作特征矩阵配置为表示所述第二动作特征。
本申请通过从动作库中确定出与第一动作特征对应的第二动作特征,有助于后续更有效地根据解码模型对每个音频片段的第二动作特征进行解码,得到第三动作特征,进一步根据第三动作特征确定出待编舞音频的舞蹈动作。
S104:将每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据第三动作特征确定待编舞音频的舞蹈动作。
本申请实施例中,计算机设备可以将每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据第三动作特征确定待编舞音频的舞蹈动作,其中,解码模型是由样本音频对应的样本动作特征以及样本音频对应的样本舞蹈动作对初始的解码模型进行训练得到的,样本音频对应的样本动作特征是利用预训练的编码模型对样本音频进行编码得到的,第三动作特征配置为指示所有音频片段的动作特征,其中,该第三动作特征包括所有音频片段中人体的关键点数量和每个关键点的位置。在某些实施例中,该第三动作特征包括但不限于第三动作特征矩阵,例如,该第三动作特征可以为T*24*3的矩阵,其中,T配置为指示音频片段的数量,24表示人体的关键点,3配置为指示每个关键点的三维坐标。
计算机设备在根据第三动作特征确定待编舞音频的舞蹈动作时,可以根据第三动作特征确定所有音频片段中人体的关键点数量和每个关键点的位置;根据所有音频片段中人体的关键点数量和每个关键点的位置确定待编舞音频的舞蹈动作。例如,假设第三动作特征为T*24*3的矩阵,则计算机设备可以根据T*24*3的矩阵确定所有音频片段中人体的关键点数量和每个关键点的位置,并根据所有音频片段中人体的关键点数量和每个关键点的位置确定所有音频片段中的人体动作,进一步将所有音频片段的人体动作确定为待编舞音频的舞蹈动作。
本申请实施例通过预训练的编码模型对待编舞音频的音频片段进行编码,得到对应的第一动作特征,并从动作库预先存储的多种舞蹈动作的动作特征中确定与第一动作特征对应的第二动作特征,有助于利用第二动作特征通过预训练的解码模型解码生成更准确、高质量的待编舞音频的舞蹈动作。
请参见图4,图4是本申请实施例提供的另一种舞蹈动作生成方法的流程示意图,本申请实施例的舞蹈动作生成方法可以由舞蹈动作生成装置执行,其中,舞蹈动作生成装置设置于终端或计算机设备中,其中,终端或计算机设备的具体解释如前。具体地,本申请实施例主要是对编码模型的训练过程的说明,具体包括如下步骤。
S401:获取待编舞音频,并从该待编舞音频中提取多个音频片段。
S402:获取样本数据集,样本数据集包括多个样本舞曲数据,每个样本舞曲数据包括样本音频和样本舞蹈动作。
S403:根据每个样本舞曲数据的样本音频和样本舞蹈动作对初始的编码模型进行训练,得到预训练的编码模型。
在一个实施例中,计算机设备在根据每个样本舞曲数据的样本音频和样本舞蹈动作对初始的编码模型进行训练,得到预训练的编码模型时,可以从每个样本舞曲数据的样本舞蹈动作中提取样本动作特征;将样本动作特征和样本音频输入第一编码模型进行训练,得到预训练的编码模型。在某些实施例中,该样本动作特征包括但不限于矩阵的数据形式。
在一个实施例中,计算机设备在从每个样本舞曲数据的样本舞蹈动作中提取样本动作特征时,可以获取每个样本舞曲数据的每个样本舞蹈动作对应的人体的关键点数量和关键点位置,关键点位置包括每个关键点的坐标;并将每个样本舞蹈动作对应的人体的关键点数量和关键点位置输入初始的编码模型,提取得到样本动作特征。在某些实施例中,每个样本舞蹈动作的人体的关键点可以包括人体的24个关键点,在某些实施例中,每个样本舞蹈动作的关键点位置可以包括每个样本舞蹈动作的三维坐标数据。
在一个实施例中,计算机设备在将样本动作特征和样本音频输入初始的编码模型进行训练,得到预训练的编码模型时,可以将样本动作特征和样本音频输入初始的编码模型进行训练,得到第一样本动作特征,根据第一样本动作特征从动作库预先存储的多种舞蹈动作的动作特征确定与第一样本动作特征相似的第二样本动作特征;将第二样本动作特征输入初始的解码模型,得到第三样本动作特征;根据第三样本动作特征调整初始的解码模型的模型参数,并将第二样本动作特征输入调整后的解码模型进行训练,得到预训练的解码模型。在某些实施例中,第三样本动作特征配置为指示样本音频中人体的关键点数量以及每个关键点的位置。
在一个实施例中,计算机设备在根据第三样本动作特征调整初始的解码模型的模型参数,并将第二样本动作特征输入调整后的解码模型进行训练,得到预训练的解码模型时,可以根据第三样本动作特征确定样本音频的舞蹈动作;将确定的样本音频的舞蹈动作与样本音频的样本舞蹈动作进行对比,根据对比结果调整初始的解码模型的模型参数;将第二样本动作特征输入调整模型参数后的解码模型重新训练,以得到预训练的解码模型。
计算机设备在根据第三样本动作特征确定样本音频的舞蹈动作时,可以根据第三样本动作特征确定样本音频中人体的关键点数量和每个关键点的位置;根据样本音频中人体的关键点数量和每个关键点的位置确定样本音频的舞蹈动作。
进一步地,计算机设备在将确定的样本音频的舞蹈动作与样本音频的样本舞蹈动作进行对比时,可以将样本音频的舞蹈动作的动作特征矩阵与该样本舞蹈动作的动作特征进行对比。
进一步地,计算机设备在根据对比结果调整第一编码模型的模型参数时,可以计算样本 音频的舞蹈动作的动作特征中的各个向量与该样本舞蹈动作的动作特征的各个向量的向量距离,当样本音频的舞蹈动作的动作特征中的向量与该样本舞蹈动作的动作特征的向量的向量距离大于第一距离阈值时,根据样本音频的舞蹈动作的动作特征中的向量与该样本舞蹈动作的动作特征矩阵的向量的向量距离调整初始的编码模型的模型参数。
本申请通过初始的编码模型可以得到各个样本舞蹈动作与样本动作特征的映射关系,有助于根据样本动作特征和样本音频训练初始的编码模型,得到预训练的编码模型,并得到样本动作特征、样本音频、样本音频的舞蹈动作的动作特征矩阵三者之间的映射关系。
S404:将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征。
S405:根据每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与每个音频片段的第一动作特征相似的第二动作特征。
S406:将每个音频片段的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据第三动作特征确定待编舞音频的舞蹈动作。
本申请实施例通过初始的编码模型得到各个样本舞蹈动作与样本动作特征的映射关系,有助于根据样本动作特征和样本音频训练初始的编码模型,得到预训练的编码模型,并得到样本动作特征、样本音频、第一样本动作特征三者之间的映射关系,以实现在测试时通过预训练的编码模型生成待编码音频的第一动作特征。
请参见图5,图5是本申请实施例提供的又一种舞蹈动作生成方法的流程示意图,本申请实施例的舞蹈动作生成方法可以由舞蹈动作生成装置执行,其中,舞蹈动作生成装置设置于终端或计算机设备中,其中,终端或计算机设备的具体解释如前。具体地,本申请实施例主要是对解码模型的训练过程的说明,具体包括如下步骤。
S501:获取待编舞音频,并从该待编舞音频中提取多个音频片段。
S502:将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征。
S503:根据每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与每个音频片段的第一动作特征相似的第二动作特征。
S504:将样本音频输入预训练的编码模型,得到与样本音频对应的第一样本动作特征。
S505:根据第一样本动作特征对预设的解码模型进行训练,得到预训练的解码模型。
在一个实施例中,计算机设备在根据第一样本动作特征对初始的解码模型进行训练,得到预训练的解码模型时,可以根据第一样本动作特征从动作库预先存储的多种舞蹈动作的动作特征确定与第一样本动作特征相似的第二样本动作特征;将第二样本动作特征输入初始的解码模型进行训练,得到预训练的解码模型。
在一个实施例中,计算机设备在将第二样本动作特征输入初始的解码模型进行训练,得到预训练的解码模型时,可以将第二样本动作特征输入初始的解码模型,得到第三样本动作特征;根据第三样本动作特征调整初始的解码模型的模型参数,并将第二样本动作特征输入调整后的解码模型进行训练,得到所述预训练的解码模型。
在一个种实施方式中,计算机设备可以根据第三样本动作特征确定样本音频的舞蹈动作;将确定的样本音频的舞蹈动作与样本音频的样本舞蹈动作进行对比,根据对比结果调整初始的解码模型的模型参数;将第二样本动作特征输入调整模型参数后的解码模型重新训练,得到预训练的解码模型。
计算机设备在根据第三样本动作特征确定样本音频的舞蹈动作时,可以根据第三样本动 作特征确定样本音频中人体的关键点数量和每个关键点的位置;根据样本音频中人体的关键点数量和每个关键点的位置确定样本音频的舞蹈动作。
在一个种实施方式中,计算机设备在根据第三样本动作特征调整模型参数时,可以根据第三样本动作特征和样本动作特征调整初始的解码模型的模型参数。
进一步地,计算机设备在根据第三样本动作特征和样本动作特征调整初始的解码模型的模型参数时,可以根据第三样本动作特征矩阵中的向量与样本动作特征矩阵中的向量之间的向量距离调整初始的解码模型的模型参数。
进一步地,当第三样本动作特征中的向量与样本动作特征中的向量之间的向量距离大于第二距离阈值时,计算机设备可以根据第三样本动作特征中的向量与样本动作特征中的向量之间的向量距离调整初始的解码模型的模型参数。
S506:将每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据第三动作特征确定待编舞音频的舞蹈动作。
本申请实施例通过预训练的编码模型生成样本音频的第一样本动作特征,并根据第一样本动作特征对初始的解码模型进行训练,得到预训练的解码模型,有助于在测试时更准确地生成待编舞音频的第三动作特征,从而生成更准确,质量更高的舞蹈动作。
请参见图6,图6是本申请实施例提供的又一种舞蹈动作生成方法的流程示意图,通过获取待编舞音频61,并从待编舞音频中提取多个音频片段62,将多个音频片段输入预训练的编码模型,得到多个音频片段中每个音频片段的第一动作特征63,从动作库预先存储的多种舞蹈动作的动作特征中确定与每个音频片段的第一动作特征相似的第二动作特征,将每个音频片段的第二动作特征输入预训练的解码模型,得到所有音频片段的第三动作特征64,从而根据第三动作特征确定待编舞音频的舞蹈动作。
请参见图7,图7是本申请实施例提供的一种舞蹈动作生成装置的结构示意图。具体的,舞蹈动作生成装置设置于计算机设备中,装置包括:获取单元701、编码单元702、确定单元703、解码单元704;
获取单元701,配置为获取待编舞音频,并从所述待编舞音频中提取多个音频片段;
编码单元702,配置为将所述多个音频片段输入预训练的编码模型,得到所述多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和所述样本音频对应的样本舞蹈动作训练得到的;
确定单元703,配置为根据所述每个音频片段的第一动作特征从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征;
解码单元704,配置为将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据所述第三动作特征确定所述待编舞音频的舞蹈动作,其中,所述解码模型是由所述样本音频对应的样本动作特征以及所述样本音频对应的所述样本舞蹈动作训练得到的,所述样本音频对应的样本动作特征是利用所述预训练的编码模型对所述样本音频进行编码得到的,所述第三动作特征配置为指示所有音频片段的动作特征。
进一步地,获取单元701从所述待编舞音频中提取多个音频片段时,具体配置为:
获取所述待编舞音频的节拍信息;
根据所述节拍信息从所述待编舞音频中提取多个音频片段。
进一步地,所述第一动作特征具体为第一动作特征矩阵;确定单元703根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征时,具体配置为:
针对每个所述音频片段,获取所述音频片段对应的第一动作特征矩阵中的各个行向量,得到所述音频片段对应的多个第一动作特征向量,其中,每个所述行向量配置为指示一个动作;
针对所述每个音频片段的每个所述第一动作特征向量,从所述动作库预先存储的多种舞蹈动作的动作特征中确定与所述第一动作特征向量对应的第二动作特征向量,得到所述每个音频片段对应的多个第二动作特征向量;
针对所述每个音频片段,将所述音频片段对应的多个第二动作特征向量进行组合,得到所述音频片段对应的第二动作特征矩阵,其中,所述第二动作特征矩阵配置为表示所述第二动作特征。
进一步地,确定单元703从所述动作库预先存储的多种舞蹈动作的动作特征中确定与所述第一动作特征向量对应的第二动作特征向量时,具体配置为:
获取所述第一动作特征向量与所述动作库中每个动作特征向量的距离,所述动作库中的所述每个动作特征向量配置为指示预先存储的一个舞蹈动作;
从所述动作库中获取与所述第一动作特征向量距离最短的动作特征向量,作为所述第一动作特征向量对应的第二动作特征向量。
进一步地,编码单元702将所述多个音频片段输入预训练的编码模型,得到与所述多个音频片段中每个音频片段的第一动作特征矩阵之前,还配置为:
获取样本数据集,所述样本数据集包括多个样本舞曲数据,每个样本舞曲数据包括样本音频和样本舞蹈动作;
从所述每个样本舞曲数据的样本舞蹈动作中提取样本动作特征;
将所述样本动作特征和所述样本音频输入初始的编码模型进行训练,得到所述预训练的编码模型。
进一步地,编码单元702从所述每个样本舞曲数据的样本舞蹈动作中提取样本动作特征时,配置为:
获取每个所述样本舞蹈动作对应的人体的关键点数量和关键点位置;
将所述每个样本舞蹈动作对应的人体的关键点数量和关键点位置输入所述初始的编码模型,提取所述样本动作特征。
进一步地,解码单元704将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征之前,还配置为:
将所述样本音频输入所述预训练的编码模型,得到所述样本音频的第一样本动作特征;并根据所述第一样本动作特征从动作库确定与所述第一样本动作特征相似的第二样本动作特征;
将所述第二样本动作特征输入初始的解码模型,得到第三样本动作特征;并根据所述第三样本动作特征确定所述样本音频的舞蹈动作;
将确定的所述样本音频的舞蹈动作与所述样本音频的样本舞蹈动作进行对比,根据对比结果调整所述初始的解码模型的模型参数,以得到所述预训练的解码模型。
进一步地,解码单元704根据所述第三动作特征确定所述待编舞音频的舞蹈动作时,具体配置为:
根据所述第三动作特征确定所述待编舞音频中人体的关键点数量和每个关键点的位置;
根据所述待编舞音频中人体的关键点数量和每个关键点的位置确定所述待编舞音频的舞蹈动作。
本申请实施例通过预训练的编码模型对待编舞音频的音频片段进行编码,得到每个音频片段的第一动作特征,并从动作库预先存储的多种舞蹈动作的动作特征中确定与第一动作特征相似的第二动作特征,有助于利用第二动作特征通过预训练的解码模型解码生成更准确、高质量的待编舞音频的舞蹈动作。
请参见图8,图8是本申请实施例提供的一种计算机设备的结构示意图。具体的,所述计算机设备包括:存储器801、处理器802。
在一种实施例中,所述计算机设备还包括数据接口803,所述数据接口803,配置为传递计算机设备和其他设备之间的数据信息。
所述存储器801可以包括易失性存储器(volatile memory);存储器801也可以包括非易失性存储器(non-volatile memory);存储器801还可以包括上述种类的存储器的组合。所述处理器802可以是中央处理器(central processing unit,CPU)。所述处理器802还可以进一步包括硬件芯片。上述硬件芯片可以是专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA)或其任意组合。
所述存储器801配置为存储程序,所述处理器802可以调用存储器801中存储的程序,配置为执行如下步骤:
获取待编舞音频,并从所述待编舞音频中提取多个音频片段;
将所述多个音频片段输入预训练的编码模型,得到所述多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和所述样本音频对应的样本舞蹈动作训练得到的;
根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征;
将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据所述第三动作特征确定所述待编舞音频的舞蹈动作,其中,所述解码模型是由所述样本音频对应的样本动作特征以及所述样本音频对应的所述样本舞蹈动作训练得到的,所述样本音频对应的样本动作特征是利用所述预训练的编码模型对所述样本音频进行编码得到的,所述第三动作特征配置为指示所有音频片段的动作特征。
进一步地,处理器802从所述待编舞音频中提取多个音频片段时,具体配置为:
获取所述待编舞音频的节拍信息;
根据所述节拍信息从所述待编舞音频中提取多个音频片段。
进一步地,所述第一动作特征具体为第一动作特征矩阵;处理器802根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征时,具体配置为:
针对每个所述音频片段,获取所述音频片段对应的第一动作特征矩阵中的各个行向量,得到所述音频片段对应的多个第一动作特征向量,其中,每个所述行向量配置为指示一个动作;
针对所述每个音频片段的每个所述第一动作特征向量,从所述动作库预先存储的多种舞蹈动作的动作特征中确定与所述第一动作特征向量对应的第二动作特征向量,得到所述每个音频片段对应的多个第二动作特征向量;
针对所述每个音频片段,将所述音频片段对应的多个第二动作特征向量进行组合,得到 所述音频片段对应的第二动作特征矩阵,其中,所述第二动作特征矩阵配置为表示所述第二动作特征。
进一步地,处理器802从所述动作库预先存储的多种舞蹈动作的动作特征中确定与所述第一动作特征向量对应的第二动作特征向量时,具体配置为:
获取所述第一动作特征向量与所述动作库中每个动作特征向量的距离,所述动作库中的所述每个动作特征向量配置为指示预先存储的一个舞蹈动作;
从所述动作库中获取与所述第一动作特征向量距离最短的动作特征向量,作为所述第一动作特征向量对应的第二动作特征向量。
进一步地,处理器802将所述多个音频片段输入预训练的编码模型,得到与所述多个音频片段中每个音频片段的第一动作特征矩阵之前,还配置为:
获取样本数据集,所述样本数据集包括多个样本舞曲数据,每个样本舞曲数据包括样本音频和样本舞蹈动作;
从所述每个样本舞曲数据的样本舞蹈动作中提取样本动作特征;
将所述样本动作特征和所述样本音频输入初始的编码模型进行训练,得到所述预训练的编码模型。
进一步地,处理器802从所述每个样本舞曲数据的样本舞蹈动作中提取样本动作特征时,配置为:
获取每个所述样本舞蹈动作对应的人体的关键点数量和关键点位置;
将所述每个样本舞蹈动作对应的人体的关键点数量和关键点位置输入所述初始的编码模型,提取所述样本动作特征。
进一步地,处理器802将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征之前,还配置为:
将所述样本音频输入所述预训练的编码模型,得到所述样本音频的第一样本动作特征;并根据所述第一样本动作特征从动作库确定与所述第一样本动作特征相似的第二样本动作特征;
将所述第二样本动作特征输入初始的解码模型,得到第三样本动作特征;并根据所述第三样本动作特征确定所述样本音频的舞蹈动作;
将确定的所述样本音频的舞蹈动作与所述样本音频的样本舞蹈动作进行对比,根据对比结果调整所述初始的解码模型的模型参数,以得到所述预训练的解码模型。
进一步地,处理器802根据所述第三动作特征确定所述待编舞音频的舞蹈动作时,具体配置为:
根据所述第三动作特征确定所述待编舞音频中人体的关键点数量和每个关键点的位置;
根据所述待编舞音频中人体的关键点数量和每个关键点的位置确定所述待编舞音频的舞蹈动作。
本申请实施例通过预训练的编码模型对待编舞音频的音频片段进行编码,得到每个音频片段的第一动作特征,并从动作库预先存储的多种舞蹈动作的动作特征中确定与每个音频片段的第一动作特征相似的第二动作特征,有助于利用每个音频片段的第二动作特征通过预训练的解码模型解码生成更准确、高质量的待编舞音频的舞蹈动作。
本申请的实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现本申请图1、图4或图5所对应实施例中描述的方法,也可实现图7所述本申请所对应实施例的装置,在此不再赘述。
所述计算机可读存储介质可以是前述任一实施例所述的设备的内部存储单元,例如设备的硬盘或内存。所述计算机可读存储介质也可以是所述设备的外部存储设备,例如所述设备上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述计算机可读存储介质还可以既包括所述设备的内部存储单元也包括外部存储设备。所述计算机可读存储介质配置为存储所述计算机程序以及所述终端所需的其他程序和数据。所述计算机可读存储介质还可以配置为暂时地存储已经输出或者将要输出的数据。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,所述的程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)或随机存储记忆体(Random Access Memory,RAM)等。
以上所揭露的仅为本申请的部分实施例而已,当然不能以此来限定本申请之权利范围,本领域普通技术人员可以理解实现上述实施例的全部或部分流程,并依本申请权利要求所作的等同变化,仍属于本发明所涵盖的范围。

Claims (10)

  1. 一种舞蹈动作生成方法,其中,包括:
    获取待编舞音频,并从所述待编舞音频中提取多个音频片段;
    将所述多个音频片段输入预训练的编码模型,得到所述多个音频片段中每个音频片段的第一动作特征,其中,所述编码模型是由样本音频和所述样本音频对应的样本舞蹈动作训练得到的;
    根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征;
    将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征,并根据所述第三动作特征确定所述待编舞音频的舞蹈动作,其中,所述解码模型是由所述样本音频对应的样本动作特征以及所述样本音频对应的所述样本舞蹈动作训练得到的,所述样本音频对应的样本动作特征是利用所述预训练的编码模型对所述样本音频进行编码得到的,所述第三动作特征配置为指示所有音频片段的动作特征。
  2. 根据权利要求1所述的方法,其中,所述从所述待编舞音频中提取多个音频片段,包括:获取所述待编舞音频的节拍信息;
    根据所述节拍信息从所述待编舞音频中提取多个音频片段。
  3. 根据权利要求1所述的方法,其中,所述第一动作特征具体为第一动作特征矩阵;所述根据所述每个音频片段的第一动作特征,从动作库预先存储的多种舞蹈动作的动作特征中确定与所述每个音频片段的第一动作特征相似的第二动作特征,包括:
    针对每个所述音频片段,获取所述音频片段对应的第一动作特征矩阵中的各个行向量,得到所述音频片段对应的多个第一动作特征向量,其中,每个所述行向量配置为指示一个动作;针对所述每个音频片段的每个所述第一动作特征向量,从所述动作库预先存储的多种舞蹈动作的动作特征中确定与所述第一动作特征向量对应的第二动作特征向量,得到所述每个音频片段对应的多个第二动作特征向量;
    针对所述每个音频片段,将所述音频片段对应的多个第二动作特征向量进行组合,得到所述音频片段对应的第二动作特征矩阵,其中,所述第二动作特征矩阵配置为表示所述第二动作特征。
  4. 根据权利要求3所述的方法,其中,从所述动作库预先存储的多种舞蹈动作的动作特征中确定与所述第一动作特征向量对应的第二动作特征向量,包括:
    获取所述第一动作特征向量与所述动作库中每个动作特征向量的距离,所述动作库中的所述每个动作特征向量配置为指示预先存储的一个舞蹈动作;
    从所述动作库中获取与所述第一动作特征向量距离最短的动作特征向量,作为所述第一动作特征向量对应的第二动作特征向量。
  5. 根据权利要求1所述的方法,其中,所述将所述多个音频片段输入预训练的编码模型,得到与所述多个音频片段中每个音频片段的第一动作特征矩阵之前,还包括:
    获取样本数据集,所述样本数据集包括多个样本舞曲数据,每个样本舞曲数据包括样本音频和样本舞蹈动作;
    从所述每个样本舞曲数据的样本舞蹈动作中提取样本动作特征;
    将所述样本动作特征和所述样本音频输入初始的编码模型进行训练,得到所述预训练的编码 模型。
  6. 根据权利要求5所述的方法,其中,所述从所述每个样本舞曲数据的样本舞蹈动作中提取样本动作特征,包括:
    获取每个所述样本舞蹈动作对应的人体的关键点数量和关键点位置;
    将所述每个样本舞蹈动作对应的人体的关键点数量和关键点位置输入所述初始的编码模型,提取所述样本动作特征。
  7. 根据权利要求1所述的方法,其中,所述将所述每个音频片段对应的第二动作特征输入预训练的解码模型,得到第三动作特征之前,还包括:
    将所述样本音频输入所述预训练的编码模型,得到所述样本音频的第一样本动作特征;并根据所述第一样本动作特征从动作库确定与所述第一样本动作特征相似的第二样本动作特征;
    将所述第二样本动作特征输入初始的解码模型,得到第三样本动作特征;并根据所述第三样本动作特征确定所述样本音频的舞蹈动作;
    将确定的所述样本音频的舞蹈动作与所述样本音频的样本舞蹈动作进行对比,根据对比结果调整所述初始的解码模型的模型参数,以得到所述预训练的解码模型。
  8. 根据权利要求1所述的方法,其中,所述根据所述第三动作特征确定所述待编舞音频的舞蹈动作,包括:
    根据所述第三动作特征确定所述待编舞音频中人体的关键点数量和每个关键点的位置;
    根据所述待编舞音频中人体的关键点数量和每个关键点的位置确定所述待编舞音频的舞蹈动作。
  9. 一种计算机设备,其中,包括处理器、输入设备、输出设备和存储器,所述处理器、输入设备、输出设备和存储器相互连接,其中,所述存储器配置为存储计算机程序,所述计算机程序包括程序,所述处理器被配置配置为调用所述程序,执行如权利要求1-8任一项所述的方法。
  10. 一种计算机可读存储介质,其中,该计算机可读存储介质中存储有程序指令,该程序指令被执行时,配置为实现如权利要求1-8任一项所述的方法。
PCT/CN2023/090889 2022-11-17 2023-04-26 舞蹈动作生成方法、计算机设备及存储介质 WO2024103637A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211441749.0A CN115712739B (zh) 2022-11-17 2022-11-17 舞蹈动作生成方法、计算机设备及存储介质
CN202211441749.0 2022-11-17

Publications (2)

Publication Number Publication Date
WO2024103637A1 true WO2024103637A1 (zh) 2024-05-23
WO2024103637A9 WO2024103637A9 (zh) 2024-07-11

Family

ID=85233642

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/090889 WO2024103637A1 (zh) 2022-11-17 2023-04-26 舞蹈动作生成方法、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN115712739B (zh)
WO (1) WO2024103637A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115712739B (zh) * 2022-11-17 2024-03-26 腾讯音乐娱乐科技(深圳)有限公司 舞蹈动作生成方法、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615302A (zh) * 2009-07-30 2009-12-30 浙江大学 音乐数据驱动的基于机器学习的舞蹈动作生成方法
CN110955786A (zh) * 2019-11-29 2020-04-03 网易(杭州)网络有限公司 一种舞蹈动作数据的生成方法及装置
US20200342646A1 (en) * 2019-04-23 2020-10-29 Adobe Inc. Music driven human dancing video synthesis
CN111970536A (zh) * 2020-07-24 2020-11-20 北京航空航天大学 一种基于音频生成视频的方法和装置
CN114756706A (zh) * 2022-04-06 2022-07-15 北京达佳互联信息技术有限公司 一种资源合成方法、装置、设备及存储介质
CN115712739A (zh) * 2022-11-17 2023-02-24 腾讯音乐娱乐科技(深圳)有限公司 舞蹈动作生成方法、计算机设备及存储介质

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6448483B1 (en) * 2001-02-28 2002-09-10 Wildtangent, Inc. Dance visualization of music
JP6313159B2 (ja) * 2014-08-15 2018-04-18 国立研究開発法人産業技術総合研究所 ダンス動作データ作成システム及びダンス動作データ作成方法
KR102013577B1 (ko) * 2015-09-14 2019-08-23 한국전자통신연구원 안무 구상 지원 장치 및 방법
CN110992449B (zh) * 2019-11-29 2023-04-18 网易(杭州)网络有限公司 舞蹈动作合成方法、装置、设备及存储介质
KR20210120636A (ko) * 2020-03-27 2021-10-07 주식회사 안무공장 안무 컨텐츠 기반의 부가서비스 제공 장치
KR102192210B1 (ko) * 2020-06-23 2020-12-16 인하대학교 산학협력단 Lstm 기반 댄스 모션 생성 방법 및 장치
CN111711868B (zh) * 2020-06-24 2021-07-20 中国科学院自动化研究所 基于视听多模态的舞蹈生成方法、系统、装置
CN113160848B (zh) * 2021-05-07 2024-06-04 网易(杭州)网络有限公司 舞蹈动画生成方法、模型训练方法、装置、设备及存储介质
CN113856203A (zh) * 2021-10-28 2021-12-31 广州艾美网络科技有限公司 舞蹈编辑方法、装置以及跳舞机
CN114419205B (zh) * 2021-12-22 2024-01-02 北京百度网讯科技有限公司 虚拟数字人的驱动方法及位姿获取模型的训练方法
CN114401439B (zh) * 2022-02-10 2024-03-19 腾讯音乐娱乐科技(深圳)有限公司 一种舞蹈视频生成方法、设备及存储介质
CN114758636A (zh) * 2022-03-14 2022-07-15 深圳市航天泰瑞捷电子有限公司 一种舞曲生成方法、装置、终端和可读存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101615302A (zh) * 2009-07-30 2009-12-30 浙江大学 音乐数据驱动的基于机器学习的舞蹈动作生成方法
US20200342646A1 (en) * 2019-04-23 2020-10-29 Adobe Inc. Music driven human dancing video synthesis
CN110955786A (zh) * 2019-11-29 2020-04-03 网易(杭州)网络有限公司 一种舞蹈动作数据的生成方法及装置
CN111970536A (zh) * 2020-07-24 2020-11-20 北京航空航天大学 一种基于音频生成视频的方法和装置
CN114756706A (zh) * 2022-04-06 2022-07-15 北京达佳互联信息技术有限公司 一种资源合成方法、装置、设备及存储介质
CN115712739A (zh) * 2022-11-17 2023-02-24 腾讯音乐娱乐科技(深圳)有限公司 舞蹈动作生成方法、计算机设备及存储介质

Also Published As

Publication number Publication date
CN115712739B (zh) 2024-03-26
CN115712739A (zh) 2023-02-24

Similar Documents

Publication Publication Date Title
CN109859736B (zh) 语音合成方法及系统
WO2020135194A1 (zh) 基于情绪引擎技术的语音交互方法、智能终端及存储介质
WO2022116977A1 (zh) 目标对象的动作驱动方法、装置、设备及存储介质及计算机程序产品
CN113569892A (zh) 图像描述信息生成方法、装置、计算机设备及存储介质
CN111368118B (zh) 一种图像描述生成方法、系统、装置和存储介质
CN109344242B (zh) 一种对话问答方法、装置、设备及存储介质
CN111598979B (zh) 虚拟角色的面部动画生成方法、装置、设备及存储介质
CN112115687A (zh) 一种结合知识库中的三元组和实体类型的生成问题方法
CN112233698A (zh) 人物情绪识别方法、装置、终端设备及存储介质
CN112837669B (zh) 语音合成方法、装置及服务器
CN112071330A (zh) 一种音频数据处理方法、设备以及计算机可读存储介质
CN114861653B (zh) 用于虚拟交互的语言生成方法、装置、设备及存储介质
WO2024103637A1 (zh) 舞蹈动作生成方法、计算机设备及存储介质
WO2024103637A9 (zh) 舞蹈动作生成方法、计算机设备及存储介质
CN112149651B (zh) 一种基于深度学习的人脸表情识别方法、装置及设备
CN114511860A (zh) 一种差异描述语句生成方法、装置、设备及介质
CN113886643A (zh) 数字人视频生成方法、装置、电子设备和存储介质
WO2023226239A1 (zh) 对象情绪的分析方法、装置和电子设备
CN115423908A (zh) 虚拟人脸的生成方法、装置、设备以及可读存储介质
CN109961152B (zh) 虚拟偶像的个性化互动方法、系统、终端设备及存储介质
Georgiou et al. M3: MultiModal Masking Applied to Sentiment Analysis.
CN112735377B (zh) 语音合成方法、装置、终端设备及存储介质
CN113178200A (zh) 语音转换方法、装置、服务器及存储介质
CN116895087A (zh) 人脸五官的筛选方法、装置和人脸五官的筛选系统
CN116434758A (zh) 声纹识别模型训练方法、装置、电子设备及存储介质