CN112700521A

CN112700521A - Music-driven human skeleton dance motion generation system

Info

Publication number: CN112700521A
Application number: CN202110101178.5A
Authority: CN
Inventors: 刘科成; 肖双九
Original assignee: Shanghai Jiaotong University
Current assignee: Shanghai Jiaotong University
Priority date: 2021-01-26
Filing date: 2021-01-26
Publication date: 2021-04-23

Abstract

A music-driven human skeletal dance action generating system, comprising: music characteristic extraction system, dance action generation system and dance action evaluation system based on GAN that link to each other in proper order, wherein: the dance action evaluation index receives dance actions of the dance action generation system and evaluates the superiority and inferiority of the dance actions from the three aspects of dance authenticity, diversity and complexity. The invention can generate coherent human skeleton dance actions which accord with the characteristics of input music by introducing the prior knowledge of human dance, taking a common music file as input and relying on the capability of generating new data of a dance action generation model based on GAN.

Description

Music-driven human skeleton dance motion generation system

Technical Field

The invention relates to a technology in the field of computer automatic dance, in particular to a music-driven human skeleton dance motion generation system.

Background

The computer automatic dance technology refers to a technology for automatically generating dance motion sequences through a specific computing model under the drive of music. The key point of automatic dance is to determine the mapping relationship between music and dance, and different dance guidances can create various dance movements under the same background music, and it is very difficult to spatially model human dance movements due to avoiding the generation of unnatural dance movements. Even if the sequence of dance movements generated differs only slightly from the normal human posture, the result may appear unnatural. And the research in the field also has the problem of lacking a high-quality data set, and the data collection is mainly used for the motion recognition task, not including dance motions, and even not having music matched with the dance motions.

Earlier research work abstracts automatic dance into a similarity-based retrieval problem, selects the most matched and similar segments with the input music from a pre-constructed fixed dance action set according to the difference of the input music in characteristics such as rhythm and melody, and combines the selected segments into a new dance action sequence. Dance movements generated based on the method are limited to a pre-constructed dance movement set, new movements which do not exist in the set cannot be generated, and the connection positions between basic dance movement units have the problem of movement incoherence.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a music-driven human skeleton dance motion generation system, which can generate coherent and smooth human skeleton dance motions according with the input music characteristics by introducing the priori knowledge of human dance, taking a common music file as input and relying on the capability of generating new data of a dance motion generation model based on GAN.

The invention is realized by the following technical scheme:

the invention comprises the following steps: music characteristic extraction system, dance action generation system and dance action evaluation system based on GAN that link to each other in proper order, wherein: the dance action evaluation index receives dance actions of the dance action generation system and outputs evaluation results from three aspects of dance authenticity, diversity and complexity respectively.

The music feature extraction system comprises: the short-time Fourier transform module, the music chromaticity characteristic extraction module and the initial strength detection module that link to each other in proper order, wherein: the short-time Fourier transform module receives input music information and converts the input music information into a time-frequency domain from a time domain, the music chromaticity characteristic extraction module extracts chroma reflecting music rhythm characteristics, and the initial intensity detection module detects onset strength reflecting music rhythm characteristics.

The dance action generation system based on the GAN comprises: a dance motion generator module and a dance motion discriminator module, wherein: a dance motion generator module of the network structure of the codec generates dance motions, and a dance motion discriminator module judges whether dance motion data is false data created by the dance motion generator module.

The music encoder in the dance motion generator module extracts music characteristic vectors at each moment from music data, specifically: music data passes through a plurality of one-dimensional convolution layers, then passes through a bidirectional GRU layer and a full connection layer, the convolution layers are used for data dimension reduction, the GRU layer takes time dimension into consideration, finally the full connection layer outputs a characteristic diagram with the size of T multiplied by 256, and random noise is further introduced into the characteristic diagram by the music encoder, namely: the noise generated randomly passes through a GRU layer, and then the output result of the GRU layer is combined with the music coding result and output.

The dance motion decoder in the dance motion generator module is a multilayer perceptron (MLP) composed of a plurality of fully-connected layers, one or more hidden layers are introduced into the MLP on the basis of a single-layer neural network, and the hidden layers are arranged between an input layer and an output layer.

The dance action discriminator module comprises: global dance action arbiter and local dance arbiter, wherein: the global dance action discriminator measures whether the dance action in the music data and the dance action data output by the dance action generator module accords with the music characteristics on the whole, and judges the reality of the dance action; the local dance motion judger divides dance motion data into a plurality of subsequences, and judges the reality of dance motions by judging whether the dance motions are locally continuous or not.

The dance action evaluation system comprises: a Freutt starting distance (FID) module, a mean variance module, and a mean instantaneous velocity module, wherein: the FID module measures the advantages and disadvantages of human skeleton dance actions in dance action data output by the dance action generating system from dance reality indexes, the average variance module measures the advantages and disadvantages of human skeleton dance actions in dance action data output by the dance action generating system from diversity indexes, and the average instantaneous speed module measures the advantages and disadvantages of human skeleton dance actions in dance action data output by the dance action generating system from complexity indexes.

Technical effects

The invention integrally solves the problems that the matching degree of dance motions and music generated in the prior art is not high and brand new motions cannot be generated.

Compared with the prior art, the dance device can generate various types of dance motions including female dances, ballets and mechanical dances aiming at different input music types. The ability to generate more types of dance may be obtained through a trained model if the types of dance in the dataset continue to be augmented.

Drawings

FIG. 1 is a schematic structural view of the present invention;

FIG. 2 is a schematic diagram of a music feature extraction algorithm implementation flow of the present invention;

FIG. 3 is a schematic diagram of a dance motion generator model according to the present invention;

FIG. 4 is a schematic diagram of an implementation flow of the global dance motion discriminator model according to the present invention;

FIG. 5 is a schematic diagram of an implementation process of the local dance motion discriminator model according to the present invention.

Detailed Description

As shown in fig. 1, the music-driven human skeletal dance motion generation system according to the present embodiment includes: music characteristic extraction system, dance action generation system and dance action evaluation system based on GAN that link to each other in proper order, wherein: the dance action evaluation index receives dance actions of the dance action generation system and evaluates the superiority and inferiority of the dance actions from the three aspects of dance authenticity, diversity and complexity.

The music feature extraction system comprises: the short-time Fourier transform module, the chroma feature extraction module and the onsetsnength detection module are connected in sequence, wherein: the short-time Fourier transform module receives input music information, converts the input music information from a time domain to a time-frequency domain, and then divides the time-frequency information into two branches for processing: mapping the frequency information of each frame into one octave based on a pitch significance algorithm, and extracting chroma reflecting the music melody characteristics by a chroma characteristic extraction module; and on the other hand, based on the SuperFlux algorithm, the onsetsententh detection module detects onsetsententh reflecting the music rhythm characteristics.

The time frequency information is an information representation form reflecting the change of the frequency of the music along with the time.

The short-time Fourier transform module is a variant of basic Fourier transform, and specifically comprises the following operations: a long non-stationary signal is considered as a combination of a series of shorter stationary signals, the whole long signal is divided by windowing in time, and then discrete Fourier transform is applied to each divided part of the short signal.

The chroma refers to music characteristics defined according to twelve equal temperaments and is a T multiplied by 12 dimensional matrix, wherein: t denotes the number of music frames and 12 denotes 12 semitone pitches within one octave.

The onsetsentngth is a rhythm characteristic continuous in time.

The dance action generation system based on the GAN comprises: dance action generator module and dance action arbiter module connected, wherein: the dance motion generator module generates dance motions, and the dance motion discriminator module judges whether dance motion data are false data created by the dance motion generator module.

The dance action generator module is a network structure of a coder-decoder and comprises: a music encoder and a dance motion decoder.

The music encoder extracts music characteristic vectors at each moment from music data, specifically: music data passes through a plurality of one-dimensional convolution layers, then passes through a bidirectional GRU layer and a full connection layer, the convolution layers can play a role in data dimension reduction, the GRU layer can take time dimension into consideration, and finally the full connection layer outputs a characteristic diagram with the size of T multiplied by 256.

For the conventional GAN model, the input to the generator is only random noise, whereas the input to the present invention is music data. Although the music data alone can be input to substantially achieve the intended research objective, the variety of dance movements generated is still to be improved. Considering that dancing should be a random art, even if the same person is under the same music, the dancing of each jump is not invariable, so random noise is introduced into the music characteristic vector calculated by the music coder, so that the model can generate more diversified dance movements. Thus, noise perturbations are added to the sequence data, the randomly generated noise is passed through a GRU layer, and the output of the GRU layer is then combined with the music coding results.

The dance action decoder is a multi-layer perceptron (MLP) composed of a plurality of fully-connected layers, the MLP introduces one or a plurality of hidden layers on the basis of a single-layer neural network, and the hidden layers are arranged between an input layer and an output layer. The MLP used in the invention is improved on the basis of the MLP structure, the output of each hidden layer is added with the input of the current hidden layer, the result is input into the next hidden layer, and a BN layer and a ReLU activation function are added among the hidden layers, so that the degradation problem of a deep network can be effectively avoided.

As shown in fig. 3, the dance motion discriminator module includes: a global dance action discriminator and a local dance discriminator. Wherein: the global dance action discriminator receives data of both music and dance actions, and judges whether the input dance actions accord with music characteristics on the whole and judges the reality of the dance actions; the local dance motion judger takes dance motion data as input, divides the dance motion data into a plurality of subsequences, and judges the reality of the dance motion by judging whether the dance motion is locally continuous or not.

As shown in fig. 4, the global dance motion discriminator processes the input data of the music encoder and the dance motion encoder, combines the results of the two encoders, and inputs the combined results into the subsequent two-class network, wherein the structure of the music encoder is completely the same as that of the music encoder in the dance motion generator; dance movement data and movement frame difference data respectively pass through a series of two-dimensional convolution layers, then the dance movement data and the movement frame difference data are combined, and the obtained result and the music coding result are combined through the two convolution layers and the two full-connection layers; and finally, inputting the data into a two-class network comprising a one-dimensional convolutional layer and a full-connection layer to obtain a yes or no output result, wherein the yes represents that the judger determines that the dance motion is a real motion matched with the music, and the no represents that the dance motion is a false motion which is not matched with the music and is generated by a machine.

As shown in FIG. 5, the local dance motion judger is substantially identical to the global dance motion judger, thereby omitting the music processing part and adding the unfolding (unfolding) operation of the input dance motion. The Unfold operation refers to extracting a sliding local area block from a batch of input samples, and the input parameters are similar to two-dimensional convolution. The Unfold operation can divide the complete dance action sequence into a plurality of partially overlapped dance action subsequences, and then a network similar to a global discriminator is used for processing the subsequence data to finally obtain a discrimination result of whether the input dance action is true or not

The dance action evaluation system comprises: an FID module, a mean variance module, and a mean instantaneous velocity module, wherein: the FID module measures the quality of the human skeleton dance action from the dance reality index, the average variance module measures the quality of the human skeleton dance action from the diversity index, and the average instantaneous speed module measures the quality of the human skeleton dance action from the complexity index.

The dance authenticity index, namely a key quantitative index for measuring the performance of the GAN model from the perspective of the quality of a generated sample, specifically comprises the following steps: FID | | μ_r-μ_g||²+Tr(∑_r+∑_g-2(∑_r∑_g)^1/2) Wherein: mu is an empirical mean, sigma is an empirical covariance, Tr is a trace of a matrix, r is a real data set, and g is a generated data set; smaller FID values indicate that the generated data is closer to the real data because the mean and covariance of the two are very close.

The diversity index is obtained by taking the average of the variance of the spatial position of each joint point; the dispersion degree of the data distribution is measured, and the dispersion degree is larger when the variance is larger. We believe that the larger the mean variance, the more diverse the dance movements are generated.

The complexity index is the average value of the instantaneous speed of each joint point at each moment in two dimensions of time and space; dance movements that tend to be faster represent a more complex and highly pleasing viewing assessment.

Through specific practical experiments, experimental data can be obtained by using a PyTorch machine learning library on an i7-4770K CPU and an Invitta GTX 970 independent display card, wherein the experimental data comprises the following data:

TABLE 1FID comparison

TABLE 2 mean variance comparison

TABLE 3 average instantaneous speed contrast

As can be known from data in the table, when 10-dimensional noise is introduced, the model effect is best and is closest to real dance movement; because the ballet comprises a lot of turning movements and the movement amplitude is large, the most complicated of the three types is that the generated ballet FID is much higher; the diversity was better in clitoris and ballets, and the diversity in generating mechanical dances was inferior to the reference method. In combination with practical situations, the average variance of the mechanical dance is not too large, and the mechanical dance generated by the reference method is more like a female dance, so that the average variance of the result is large. For clitoris and ballet, the results obtained with this method were slightly faster than those obtained with the reference method; for mechanical dancing, the result of the method is slightly slower, because the difference between K-POP and E-music cannot be distinguished by the model in the reference method, and given the input of E-music, the model will output a dance more like a dancing in a woman.

Compared with the prior art, the dance motion automatic generation method improves the reality, diversity and complexity of the dance motion which is automatically generated.

The foregoing embodiments may be modified in many different ways by those skilled in the art without departing from the spirit and scope of the invention, which is defined by the appended claims and all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims

1. A music-driven human skeletal dance motion generation system, comprising: music characteristic extraction system, dance action generation system and dance action evaluation system based on GAN that link to each other in proper order, wherein: the dance action evaluation index receives dance actions of the dance action generation system and outputs evaluation results from three aspects of dance authenticity, diversity and complexity respectively;

the music feature extraction system comprises: the short-time Fourier transform module, the music chromaticity characteristic extraction module and the initial strength detection module that link to each other in proper order, wherein: the short-time Fourier transform module receives input music information and converts the input music information from a time domain to a time-frequency domain, the music chromaticity characteristic extraction module extracts chroma reflecting music rhythm characteristics, and the initial intensity detection module detects onset strength reflecting music rhythm characteristics;

the chroma refers to music characteristics defined according to twelve equal temperaments and is a T multiplied by 12 dimensional matrix, wherein: t denotes the number of music frames, 12 denotes 12 semitone pitches within one octave;

the onsetsentngth is a rhythm characteristic continuous in time.

2. The system for generating music-driven human skeletal dance movements according to claim 1, wherein said GAN-based dance movement generating system comprises: a dance motion generator module and a dance motion discriminator module, wherein: a dance motion generator module of the network structure of the codec generates dance motions, and a dance motion discriminator module judges whether dance motion data is false data created by the dance motion generator module.

3. A music-driven human skeletal dance motion generation system according to claim 1, wherein said music encoder in said dance motion generator module extracts music feature vectors at each time from music data, specifically: music data passes through a plurality of one-dimensional convolution layers, then passes through a bidirectional GRU layer and a full connection layer, the convolution layers are used for data dimension reduction, the GRU layer takes time dimension into consideration, finally the full connection layer outputs a characteristic diagram with the size of T multiplied by 256, and random noise is further introduced into the characteristic diagram by the music encoder, namely: the noise generated randomly passes through a GRU layer, and then the output result of the GRU layer is combined with the music coding result and output.

4. The system for generating dance movements of human bones driven by music according to claim 1, wherein the dance movement decoder of said dance movement generator module is a multi-layered sensor consisting of a plurality of fully connected layers, said multi-layered sensor is based on a single-layered neural network and incorporates one or more hidden layers, said hidden layers are disposed between an input layer and an output layer.

5. The system for generating a music-driven human skeletal dance motion according to claim 1, wherein said dance motion discriminator module comprises: global dance action arbiter and local dance arbiter, wherein: the global dance action discriminator measures whether the dance action in the music data and the dance action data output by the dance action generator module accords with the music characteristics on the whole, and judges the reality of the dance action; the local dance motion judger divides dance motion data into a plurality of subsequences, and judges the reality of dance motions by judging whether the dance motions are locally continuous or not.

6. A music-driven human skeletal dance motion generating system according to claim 1, wherein said dance motion evaluation system includes: a Frechtt starting distance module, an average variance module, and an average instantaneous velocity module, wherein: the FID module measures the advantages and disadvantages of human skeleton dance actions in dance action data output by the dance action generating system from dance reality indexes, the average variance module measures the advantages and disadvantages of human skeleton dance actions in dance action data output by the dance action generating system from diversity indexes, and the average instantaneous speed module measures the advantages and disadvantages of human skeleton dance actions in dance action data output by the dance action generating system from complexity indexes.

7. A music-driven human skeletal dance motion generating system according to claim 6, wherein said dance reality indicator is a sample mass generated from the generationThe key quantitative index for measuring the performance of the GAN model by angle specifically comprises the following steps: FID | | μ_r-μ_g||²+Tr(∑_r+∑_g-2(∑_r∑_g)^1/2) Wherein: mu is an empirical mean, sigma is an empirical covariance, Tr is a trace of a matrix, r is a real data set, and g is a generated data set;

the diversity index is obtained by taking the average of the variance of the spatial position of each joint point;

the complexity index is the average value of the instantaneous speed of each joint point at each moment in both time and space dimensions.