CN113423005A

CN113423005A - Motion-driven intelligent music generation method and system

Info

Publication number: CN113423005A
Application number: CN202110541902.6A
Authority: CN
Inventors: 吴庆波; 施兆丰; 李宏亮; 孟凡满; 许林峰; 潘力立
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-09-21
Anticipated expiration: 2041-05-18
Also published as: CN113423005B

Abstract

The invention discloses a motion-driven intelligent music generation method and a motion-driven intelligent music generation system, wherein the motion-driven intelligent music generation method comprises the following steps: constructing a data set, extracting initial features of video frame images, correlating time sequences of video frame features, generating audio original data, and training and testing a network. The invention designs a deep neural network structure by utilizing the characteristics of rapidness and economy of a computer and trains the deep neural network structure, realizes the intelligent processing of sports videos and the generation of original audio data so as to synthesize score music, solves the problems of high difficulty and high time and economic cost of the current score music production, simultaneously establishes a motion-driven intelligent music generation functional system, can accurately extract the characteristics of input video data, generates music with higher quality, realizes the matching of the music and sports scenes, generates the subjective score MOS of the music which is more than 3.5, realizes the rapid and batch score music production of videos of sports and other subjects, and reduces the time cost and the economic cost of music production by more than one time.

Description

Motion-driven intelligent music generation method and system

Technical Field

The invention relates to the technical field of computer vision and intelligent music distribution, in particular to a motion-driven intelligent music generation method and system.

Background

Exercise is the activity of human beings in developing processes to cultivate their own physical qualities. Since the 21 st century, with the continuous progress of science and technology, people are seeking to integrate various elements into sports, such as: videos of sports are dubbed to give the audience a stronger audiovisual experience. Music is an important expression form in the art field, reflects the thinking mode of human beings and is a unified combination of regularity and creativity. On the one hand, music is generated based on certain music theory knowledge; on the other hand, music itself carries certain human emotions.

Techniques for analyzing images and extracting deep features thereof using deep learning techniques have become increasingly sophisticated. Various convolutional neural networks have emerged in recent years, such as: AlexNet, ResNet, MobileNetV3, and the like. The network performance is excellent, and deep features of the images can be well extracted. The video is composed of a series of video frame images, deep layer feature extraction is carried out on each frame image by utilizing a convolutional neural network, and then the features are subjected to time sequence correlation, so that the purpose of extracting the video features can be achieved. And mapping a series of original frame image data in the video into one-dimensional vector data so as to further analyze the video. The LRCN model proposed by Donahue et al utilizes the idea to realize the function of video behavior recognition.

In recent years, with the development of technology, music generation using deep learning technology has become a popular research topic. There are also some excellent networks in terms of audio generation, such as: MuseGAN networks proposed by Dong hw et al for multi-track signed music generation, signed music generation networks midiant proposed by Li-Chia Yang et al, JazzGAN generation networks JazzGAN, Yu, Yi et al, Conditional LSTM-GAN models for music composition according to lyrics, and MelGAN for audio quality enhancement using mel-spectrum.

A common problem with the above techniques is that none of them combines visual information with audible information. In the above-described techniques, the video feature extraction technique is generally used for behavior recognition and video description, while the audio generation technique is mainly used for speech generation and single-modality music generation. This limitation makes the above techniques difficult to accomplish the task of dubbing videos, and today in the field of artificial intelligence, a wide variety of research and work is moving from single-modality information processing and generation to multi-modality information processing and generation.

In addition, there are many current music generation efforts to generate musical sequences of different scales and different types of instruments to achieve the purpose of music generation. This concept can be used to generate music with good effect, but it is essential to select the scale and the type of musical instrument sound at different time steps according to the input. The disadvantages of this design are: when a data set is constructed, a large amount of information such as musical scales, temperaments, musical instrument types and the like of audio needs to be obtained, so that the workload of people is greatly increased; the problem of monotonous and insufficient variety of generated music caused by the generation logic is inevitable, and therefore a system and a method capable of solving the problems of high difficulty, high time and economic cost and the like of the existing music matching production are urgently needed.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides a motion-driven intelligent music generation method and system, which designs a deep neural network structure and trains the deep neural network structure by utilizing the characteristics of rapidness and economy of a computer, constructs a motion-driven intelligent music generation system, realizes the functions of intelligently processing a motion video and generating original audio data so as to synthesize score, and solves the problems of high difficulty, high time and economic cost of the current score production in the background technology.

In order to achieve the purpose, the invention provides the following technical scheme: a motion-driven intelligent music generation method comprises the following steps:

s1, constructing a data set: downloading a sports video with a score on a video website, cutting the downloaded video with equal length by writing a python program, selecting video segments, and carrying out audio-video separation on the video segments to obtain an audio-video data set;

s2, extracting initial features of the video frame image: fixing the size of each frame of image in the video, extracting the features of each frame of image by using the improved convolutional neural network, taking the image with the fixed size as the input of the convolutional neural network, and outputting a feature with the dimension of 1 multiplied by 576;

s3, time sequence correlation of video frame characteristics: extracting the initial characteristics of the image of each frame of the video according to the step S2, forming a recurrent neural network through a GRU unit to perform time sequence correlation on the initial characteristics of the video frame, and outputting the characteristics of the video frame with time sequence information;

s4, generating audio original data: an audio generation network is constructed by using a WaveNet network, input data are mapped into audio data through a void causal convolution operation in the WaveNet network, a gradient can be spread in a deep model through residual skipping connection, convergence of network training is accelerated, the output of the step S3 is used as the input of the audio generation network, and original data of audio are output;

s5, network training and testing: connecting the improved convolutional neural network, the cyclic neural network and the audio generation network together, training the network on the data set constructed in the step S1, continuously updating parameters in the network in the training process, and ending the training after the network loss is converged to obtain a model;

and inputting audio and video data into the model for testing, obtaining an audio sample point sequence at an output end, and converting the audio sample point sequence into an audible audio file by compiling codes.

Preferably, the equal-length cropping in step S1 is to make each video segment have a duration of 4 seconds by cropping.

Preferably, the selecting of the video segments in step S1 is to select video segments without noise and with obvious change in dubbing music tempo.

Preferably, in the step S2, the size of each frame of image in the video is fixed, specifically, the size of each frame of image in the video is fixed to 224 × 224 × 3.

Preferably, the modified convolutional neural network in step S2 is to change the purpose of the convolutional neural network from image classification to image feature extraction.

Preferably, the convolutional neural network in step S2 is MobileNetV 3.

A motion-driven intelligent music generation system, the system comprising the following modules:

a data set construction module: cutting the downloaded sports video with the score in equal length, selecting video segments and carrying out audio-video separation on the video segments to obtain an audio-video data set;

the video frame image initial feature extraction module: fixing the size of each frame of image in the video, and extracting the features of each frame of image by using the improved convolutional neural network;

video frame feature timing association module: forming a recurrent neural network through GRU units, carrying out time sequence correlation on the initial characteristics of the video frames, and outputting the characteristics of the video frames with time sequence information;

an audio data generation module: constructing an audio generation network by using a WaveNet network, taking the output video frame characteristics with time sequence information as the input of the audio generation network, and outputting the original data of the audio;

training and testing module: the improved convolutional neural network, the improved cyclic neural network and the improved audio generation network are connected together, a data set is input for training, the training is finished after network loss is converged, audio and video data are used as input for testing, an audio sample point sequence is obtained at an output end, and the audio sample point sequence is converted into an audible audio file.

The invention has the beneficial effects that:

1) the method of the invention takes the video frame sequence as input and the original audio sample point as output, realizes the automatic score matching of the sports video, associates the visual information with the auditory information, can output the score matching according to the input video frame sequence, and fills the blank of the work of the current video processing, music generation and the like in the aspect to a certain extent.

2) The invention generates music on the original audio point scale, and the user can train and use the system without learning the music theory knowledge in advance, thereby greatly reducing the man-made workload. And the system generates music on the scale of original audio data, so that the generated music has stronger diversity and higher degree of freedom.

3) The invention can realize rapid and batch music matching of the motion video. Compared with the traditional artificial music score, the invention can simplify the score making process, reduce the score making cost and improve the score making efficiency.

4) The invention constructs an audio and video data set with high quality and high matching degree for network training, and simultaneously establishes an intelligent music generation function system driven by motion, so that the feature extraction can be accurately carried out on input video data, music with high quality is generated, the matching of the music and a motion scene is realized, the subjective score MOS of the generated music is more than 3.5, the video of the subject matters of sports and the like is rapidly and massively generated, and the time cost and the economic cost of music production are reduced by more than one time.

Drawings

FIG. 1 is a schematic flow diagram of the process of the present invention;

FIG. 2 is a schematic diagram of a hole causal convolution according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a residual jump connection structure according to an embodiment of the present invention;

FIG. 4 is a schematic representation of the network structure of MobileNet V3;

FIG. 5 is a schematic diagram of a network architecture according to the present invention;

FIG. 6 is a schematic diagram of the operation flow of the network training of the present invention;

FIG. 7 is a schematic diagram illustrating the operation of the network test according to the present invention;

FIG. 8 is a schematic view of the network feature extraction visualization of the MobileNet V3 according to the present invention;

FIG. 9 is a graph illustrating a loss curve of network training according to the present invention;

FIG. 10 is a schematic diagram of the comparison between the generated audio and the original audio waveform according to the present invention;

FIG. 11 is a diagram illustrating a comparison between the generated audio and the original audio spectrum according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Referring to fig. 1-6, the present invention provides a technical solution: a motion-driven intelligent music generation method is provided, and the method flow is shown in fig. 1, and comprises the following steps:

step 1: a data set is constructed. Firstly, downloading a large amount of sports videos with the score (mixed scissors, collection and the like with the score) on a video website; secondly, a python program is written to automatically cut the downloaded video with equal length (the duration of each video segment is 4 seconds); thirdly, artificially selecting high-quality video clips which have obvious change of the dubbing music rhythm and no noise; and finally, compiling codes to perform operations such as audio-video separation, audio-video format conversion and the like on the audio-video files, naming the separated audio-video files according to a specific rule, and constructing an audio-video data set. Finally, 4000 pairs of high-quality, strong-diversity and high-matching-degree audio and video data sets are constructed, wherein the training sets 3123 are paired, and the testing sets 877 are paired.

Python is most advantageous in that it supports many neural network frameworks, such as: PyTorch, TensorFlow, etc., developers can conveniently use the neural network toolkits to carry out network design, network construction, network training and testing, etc., so that excessive time is not consumed in the calculation and design of the bottom layer. Therefore, for constructing the method and the system, Python is the most suitable programming language, and in terms of the neural network framework, PyTorch is selected as the framework for constructing the neural network of the system, and has the advantages of strong simplicity, high efficiency, high speed, high frequency of development and update, simplicity, easiness in learning and the like. Based on the above advantages, we selected the PyTorch neural network framework to construct the system.

Step 2: and extracting initial features of the video frame image. And loading the video and fixing the size of each frame of image in the video to be 224 multiplied by 3, so that the convolutional neural network can conveniently extract the characteristics of each frame of image. In the aspect of structural design of the convolutional neural network, systematic and comprehensive comparison is carried out on the MobileNet V3 convolutional neural network and other classical convolutional neural networks such as ResNet-50 and NasNet in the aspects of image classification accuracy, multiplication and addition times and model parameter quantity. The results show that: when the MobileNet V3 convolutional neural network is operated, the good accuracy of image classification can be kept under the condition of relatively low calculation cost, and the model generated by training is small in scale and low in parameter quantity. Therefore, the MobileNet V3 convolutional neural network well considers the accuracy and the calculation efficiency of image feature extraction, and is very suitable for the task of extracting the video frame image by the system. The structure of the MobileNet V3 network is utilized, the network structure is modified to a certain extent on the basis, and the usage of the convolutional neural network is changed from image classification to image feature extraction. The structure of the MobileNet V3 convolutional neural network can reduce the calculation amount of a system on the basis of accurately extracting image features, and the neural network is strong in performance and high in efficiency. As shown in fig. 4, the MobileNetV3 network structure table outputs a feature having a dimension of 1 × 576 using an image having a size of 224 × 224 × 3 as an input to the convolutional neural network.

And step 3: temporal correlation of video frame features. After the initial features of the image of each frame of the video are extracted, the initial features are subjected to time sequence association, and the video frame features with time sequence information are output. We use GRU units to form the network at this step, in this way to time-sequence the features. GRUs, collectively referred to as gated cyclic units, are a variant of the Recurrent Neural Network (RNN). The gate control unit is added into each unit, so that the problems of gradient disappearance and gradient explosion of the traditional RNN are well solved, and meanwhile, the GRU unit has the advantage of low calculation amount. Therefore, in the invention, GRU units are selected to form a neural network to perform time sequence correlation on video frame characteristics.

And 4, step 4: audio raw data is generated. In the construction of the audio generation network, the technical ideas of 'hole causal convolution' and 'residual block' in the WaveNet network and jump connection are utilized to construct the autoregressive audio generation network for modeling the input and output time sequence information. The input data is mapped into audio data through a series of hole causal convolution operations, the hole causal convolution is schematically shown in fig. 2, the gradient can be spread in a deep model through residual step jumping connection, convergence of network training is accelerated, and the residual step jumping connection is shown in fig. 3. The input of the network is the output of the last step frame characteristic time sequence correlation network, and the output is the original data of audio.

And 5: and (5) network training and testing. Connecting the networks (convolutional neural network, cyclic neural network and audio generation network) designed in the

steps

2, 3 and 4 together, training the networks on the data set constructed in the step 1, continuously updating parameters in the networks in the training process to enable the parameters to be close to an optimal value, finishing the training after the network loss is converged, and obtaining the model after the training is finished. The method is used for testing, and audio and video data are input into the model during testing, so that a group of original audio sample point sequences can be obtained at the output end. Finally, writing codes to convert the generated audio sample point sequence into an audible audio file, namely, completing the generation work of music, and the network structure is shown in fig. 5.

The invention associates the visual information with the auditory information, can output the music according to the input video frame sequence, fills the blank of the current video processing, music generation and other works on the aspect, generates the music on the original audio point scale, and can train and use the system without learning the music theory knowledge in advance by a user, thereby greatly reducing the man-made workload. And the system generates music on the scale of original audio data, so that the generated music has stronger diversity and higher degree of freedom. The method can realize rapid and batch music matching of the motion video. Compared with the traditional artificial music score, the invention can simplify the score making process, reduce the score making cost and improve the score making efficiency.

The invention also provides a technical scheme, and the motion-driven intelligent music generation system comprises the following modules:

The invention completes the video reading work by utilizing the Opencv extension packet in the Python. Firstly, reading information such as the frame rate, the frame number and the image size of each frame of a video by using a tool in Opencv (the frame rate of the video processed by the system is 30 frames); secondly, extracting each frame of image in the video respectively, and fixing the size of each frame of image to be 224 multiplied by 3 in order to improve the operation speed of a subsequent convolutional neural network; finally, z-score normalization is performed on each frame of image, namely, the intensity value of each pixel of the image is subtracted by the mean value of the intensities of all pixels of the image and then divided by the standard deviation of the intensities of all pixels, and the z-score normalization is performed on the data so that the distribution of all data is normally distributed with the mean value of 0 and the standard deviation of 1, which can make the neural network converge more quickly during iterative training. The z-score normalization formula is as follows:

in the above formula, x^*Representing the intensity value corresponding to the pixel point after normalization, x representing the original intensity value of the pixel point,

represents the mean of all pixel intensity values of the image and σ represents the standard deviation of all pixel intensity values of the image.

The standardized frame images are arranged into a matrix with dimension of Nx 224 x 3(N represents the frame number of the video, the same below) according to the sequence of the frames in the original video, namely, the reading and preprocessing process of the video is realized.

In audio reading, we use the wave library in Python, which can read and write the audio file in WAV format quickly. In the process of audio reading, original audio information such as the number of channels, quantization bits, sampling frequency, sampling points and the like of audio data is obtained firstly (the number of channels of all audio files in a constructed data set is 2, the quantization bits are 16 bits, and the sampling frequency is 44100 Hz); and then reading amplitude information on each sampling point of an original audio sequence corresponding to the audio file, and finally storing the information in a one-dimensional vector in a long integer data form, thereby completing the reading of the audio file.

The audio preprocessing is to perform μ law companding quantization and channel separation on the original audio data. The first step is to carry out mu-law companding processing on original audio data: the amplitude values in the original 16-bit audio data range from-32767 to +32768, and the data type of the data corresponding to the amplitude values is a long integer. Firstly, dividing all the amplitude values of the audio sample points by 32768, normalizing the amplitude values of the audio sample points to be between-1 and 1, and then carrying out nonlinear transformation on the amplitude values according to the following formula:

f(x_t) Representing the amplitude value of the audio sample point after mu-law coding, sign representing sign function, mu being a preset mu constant, x, of the mu-law coding_tRepresenting audio point amplitude values normalized to between-1 and 1.

The result after the non-linear transformation is quantized equally into 256 possible values with amplitude values ranging from 0 to 255. By the method, the audio sample point amplitude value data with 16 bits of original quantization bits can be converted into audio sample point amplitude value data with 8 bits of quantization bits after mu-law companding transformation. After the mu-law companding quantization operation is completed, the amplitude values of all the audio sample points are divided by 255, so that the amplitude values of the audio sample points are normalized to be between 0 and 1, which is beneficial to the calculation of the subsequent neural network loss. The second step is a sound channel separation operation, in the two-channel audio file, the audio sample points corresponding to the two sound channels are arranged in a cross form, namely: all audio sample points with even numbers are sample points of channel 1, and all audio sample points with odd numbers are sample points of channel 2. In the channel separation operation, the system separates the audio sample points of the two channels and only retains the audio sample point of channel 1 for subsequent network training.

Firstly, inputting an image sequence with the size of Nx 224 x 3 standardized by z-score into a convolutional neural network in sequence, outputting a series of one-dimensional vectors at the output end of the network after a series of convolution, pooling and other processing of the convolutional neural network, wherein the vectors respectively correspond to each input frame image in sequence, the dimensionality of each output vector is 1 x 576, and the dimensionality of corresponding features of all frames of a video is Nx 576; secondly, inputting the vectors into a network formed by GRU units in sequence to obtain a series of corresponding one-dimensional vectors with time sequence correlation characteristics, wherein in the system, the dimension of the one-dimensional vector output by the GRU network is set to be 1 multiplied by 4410, and the dimension of the characteristic corresponding to the whole video data is N multiplied by 4410; secondly, inputting the characteristics corresponding to the video data with the dimensionality of Nx 4410 into an audio generation network; finally, the output dimension of the audio generation network is N × 1470, where the physical meaning of N is each frame image of the original video, and the physical meaning of 1470 is that the audio generation network maps the features of one frame image with timing correlation information to 1470 audio sample points. Sigmoid operation is performed on the output of the audio generation network, and the amplitude of all output audio sample points is normalized to be between 0 and 1, which corresponds to the above normalization mode for the original audio. And expanding the network output matrix with the dimension of Nx1470 into a one-dimensional vector, performing joint loss calculation on the one-dimensional vector and the preprocessed original audio, and finally updating the parameters of the neural network in a back propagation mode.

The wave library in Python is also used in the audio file generation process. After the output of the neural network is obtained, firstly, the amplitude of the output one-dimensional audio sample point is multiplied by 255, and the amplitude value of the audio sample point is subjected to a denormalization operation; carrying out mu-law decoding on the audio data to convert the audio data back to a form of a 16-bit amplitude value; and finally, generating an audio file in a WAV format for the audio data sequence by setting the audio sampling rate to be 44100Hz, the number of the sound channels to be 1 and the quantization digit to be 16 bits.

When the system is in the training mode, firstly, iteratively reading codes by using a pre-compiled data set, and sequentially obtaining file paths of audio and video data pairs in the training set; secondly, respectively reading and preprocessing corresponding audio and video data; then, inputting the video data to a deep neural network module for forward propagation to obtain the output of the neural network; and finally, performing loss calculation on the output of the neural network and the preprocessed audio data, reversely propagating and updating network parameters, and performing multiple rounds of training to enable the network loss to be converged, so that a network model can be obtained and stored. The operation flow of the network training is shown in fig. 6. When the system is in a test mode, firstly, iteratively reading codes by using a pre-programmed data set, and sequentially obtaining file paths of video data in the test set; secondly, reading and preprocessing the corresponding video data; then, loading the trained network model, and inputting the video data into the loaded network model; and finally, inputting the output of the network into an audio file generation module, and storing the generated audio file. The operation flow of the network test is shown in fig. 7. The system enables the training and testing work of the neural network to be well separated through the design of the training mode and the testing mode, and the original complex operation becomes simpler and easier.

In the system, the hyper-parameters needing to be manually set are as follows: a neural network loss function, a loss correlation constant λ, an optimizer, an initial learning rate, a learning rate decay scheme, a batch size, a number of epochs, and a number of retained audio channels. In the system, a network loss function is L2 loss and cosine loss, a loss correlation constant lambda is 1.5, an optimizer is an Adam optimizer, an initial learning rate is 0.0001, a learning rate attenuation scheme is that the initial learning rate is reduced to 0.5 times of the original learning rate in each 100 periods, the batch size is 1, the time period number is 200, and the number of reserved audio channels is 1. Practice proves that the loss correlation constant needs to be properly adjusted to enable the network loss calculation results of the two loss functions to be similar in value, so that the influence degrees of the two loss functions on the training of the network can be approximately the same. The proper increase of the batch size can enable the network to be more stable in convergence, and the reserved number of audio channels is set to be 2, so that the generated music has more stereoscopic impression and diversity, and the quality of the generated audio is improved, but the cost is increased. When the network is actually trained, the loss correlation constant λ needs to be properly adjusted, and the batch size can be appropriately increased and the number of audio channels can be set to 2 on the premise that the calculation is allowed. In addition, because the Adam optimizer is sensitive to the initial learning rate parameters, the network cannot be converged due to incorrect initial learning rate setting, and some optimization schemes of adaptive learning rate, such as adagard and adapelta, can be selected when the network is actually trained.

In the video reading and preprocessing module, the parameters that need to be manually set include: and the width and the height of each frame of image after preprocessing. In the present system, the image width of each frame is set to 224 pixels and the image height is set to 224 pixels.

In the audio reading module, the parameters do not need to be set manually.

In the audio preprocessing module, the parameters that need to be manually set include: the mu parameter in the mu law companding process and the amplitude type of the processed audio. The same parameter settings as the international standard are adopted in the system, namely: the number of the quantized audio amplitudes is 256, μ ═ 255.

In the deep neural network module, the parameters to be manually set are as follows: the output characteristic dimension of the GRU unit, the input characteristic dimension of the audio generation network, the output characteristic dimension of the audio generation network, the number of convolution layers of the audio generation network, the number of convolution layer cycles in the audio generation network and the size of a residual error block in the audio generation network. In the system, the output characteristic dimension of the GRU unit is set to be 4410, the input characteristic dimension of the audio generation network is set to be 4410, the output characteristic dimension of the audio generation network is 1470, the number of convolution layers of the audio generation network is 4, the number of circulation times of the audio generation network is 2, and the residual block size of the audio generation network is 1024. Practice proves that the learning speed and the fitting capacity of the network to data can be remarkably improved by increasing the size of the residual block of the audio generation network so as to achieve the purpose of rapid convergence; the perception range of the network to the input data can be enlarged by increasing the convolution layer number of the audio generation network, so that the output audio has more consistency; increasing the output characteristic dimension of the GRU unit and the input characteristic dimension of the audio generation network also enables the network to capture characteristic information of more data, thereby improving the performance of the network. Under the condition of being allowed by calculation, the parameters can be properly improved to improve the performance of the system.

In the audio file generation module, the parameters to be manually set include the number of channels for generating the audio file and the sampling rate for generating the audio file. In the system, the number of the sound channels for generating the audio file is set to be 1, and the sampling rate of the generated audio file is set to be 44100 Hz.

When the neural network is actually used for calculation, the involved calculation is complex, and the calculation graph can present the complex calculation process of the neural network in a visual form, so that the complex process can be intuitive and easy to study. Therefore, the neural network computational graph is beneficial to further understanding of the computation process of the neural network, and a TensorBoard tool is used for drawing the computational graph of the deep neural network in the system, so that the image features extracted by the MobileNet V3 convolutional neural network are visualized. Firstly, switching a system into a test mode and inputting audio and video data; then carrying out dimension transformation operation on the vector output by the MobileNet V3 network so as to better observe the extracted feature vector; then storing the video frame and the output characteristics of the MobileNet V3 network in a picture form; finally, the video frame sequence is compared with the extracted features of the MobileNet V3 network for observation. The final result is shown in fig. 8, and the feature vector extracted by MobileNetV3 is essentially the data output after each frame image in the video passes through the convolutional neural network. Although these data are abstract and difficult to directly understand by visualization, it can be seen from fig. 8 that: when the characteristics of the similar images are extracted, the extraction results are also similar; when the characteristics of the images with large differences are extracted, the extraction results have large differences, and experiments prove that the MobileNet V3 network successfully extracts the characteristics of different images in the video.

In machine learning, the function of the loss function is to calculate the difference between the output of the neural network after forward propagation and the true value in each iteration, determine the direction of the network parameter gradient in the next iteration, and then use a gradient descent method to update the parameters. The difference between the output value calculated by the loss function and the true value is the loss value. The selection of the loss function is crucial in the training process of the neural network, which directly determines the success or failure of the neural network training. With the continuous development of machine learning, a variety of loss functions for solving various tasks are emerged, such as: l1 loss, L2 loss, SmoothL1 loss for distance measurement; a cosine loss function for measuring the similarity degree of the two vectors; and cross entropy loss, negative log likelihood loss, etc. for solving the classification task.

Because the neural network needs to learn the phase information of the audio data and the amplitude information of the audio data, the cosine loss function and the L2 loss function are used for carrying out combined training on the neural network, so that the amplitude and the phase of music generated by the system are closer to those of original music. In addition, a loss correlation constant lambda is set, cosine loss and L2 loss are well correlated, so that the two losses have the same influence on the training of the neural network, and the phenomenon that the training of the neural network is deviated because the influence of one loss is stronger than that of the other loss is avoided. The relationship between the two loss functions is as follows:

Loss＝CosLoss+λ×L2Loss

the training loss curve is drawn by using a Tensoboard tool, the visualization condition of the training loss is shown in FIG. 9, and it can be seen from the loss curve graph that the cosine loss value, the L2 loss value and the total loss value of the neural network are all continuously reduced to be finally converged along with the iteration of data, which shows that the parameters in the deep neural network in the system are effectively updated, and the network is effectively trained.

And utilizing MATLAB software to make a waveform diagram and a frequency spectrum diagram of the system generated audio and the original audio, and carrying out time-frequency analysis on the waveform diagram and the frequency spectrum diagram. The generated audio and the original audio are visualized in the time domain and the frequency domain, so that the difference of the generated audio and the original audio in the time domain and the frequency domain can be visually compared, and the defects of the audio generated by the system can be analyzed. In the training process of the network, the amplitude value types of original audio used for network training are reduced to 256 in a mu-law companding mode, and the amplitude values of audio sample points generated by the network are quantized to 256. The time domain and frequency domain pairs of the network generated audio subjected to μ -law companding and the original audio are shown in fig. 10 and 11, in the time-frequency comparison graph, the upper waveform represents the original audio, and the lower waveform represents the system generated audio. It can be obtained from the time domain and frequency domain comparison graph of the system generated audio and the original audio, and although the oscillogram and the frequency domain graph of the system generated audio are greatly different from the original audio in many details, the amplitude, the distribution characteristics and other characteristic information of the oscillogram and the frequency domain graph are basically the same. Therefore, the neural network in the system successfully learns the basic information such as amplitude, distribution and the like on the time domain and the frequency domain of the original audio, and can successfully generate audio sample points with the basic information such as amplitude, distribution rule and the like similar to the original audio according to the input video.

According to the invention, a high-quality audio and video data set for network training is established, and meanwhile, a motion-driven intelligent music generation function system is established, so that the feature extraction of input video data can be accurately carried out, music with higher quality is generated, the matching of the music and a motion scene is realized, the subjective score MOS of the generated music is more than 3.5, the video of subject matters such as sports is generated in a fast and batch manner, and the time cost and the economic cost of music production are reduced by more than one time.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A motion-driven intelligent music generation method is characterized by comprising the following steps:

2. The motion-driven intelligent music generation method according to claim 1, wherein: the equal-length cropping in step S1 is to make each video segment have a duration of 4 seconds by cropping.

3. The motion-driven intelligent music generation method according to claim 1, wherein: the selecting of the video segments in step S1 is to select video segments with obvious change of dubbing music rhythm and without noise.

4. The motion-driven intelligent music generation method according to claim 1, wherein: in the step S2, the size of each frame of image in the video is fixed, specifically, the size of each frame of image in the video is fixed to 224 × 224 × 3.

5. The motion-driven intelligent music generation method according to claim 1, wherein: the modified convolutional neural network in step S2 is to change the purpose of the convolutional neural network from image classification to image feature extraction.

6. The motion-driven intelligent music generation method according to claim 1, wherein: the convolutional neural network in the step S2 is MobileNetV 3.

7. A motion-driven intelligent music generation system, characterized by: the system comprises the following modules: