CN111739492B

CN111739492B - Music melody generation method based on pitch contour curve

Info

Publication number: CN111739492B
Application number: CN202010559217.1A
Authority: CN
Inventors: 郎润男; 朱松豪
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2023-07-11
Anticipated expiration: 2040-06-18
Also published as: CN111739492A

Abstract

The invention relates to the technical field of music generation, in particular to a music melody generation method based on a pitch contour curve, which comprises the following steps: step one, extracting long-term structure information of a pitch contour in a frequency domain, wherein the long-term structure information comprises a low-frequency part in a frequency domain sequence of the pitch contour, and reflects a long-term trend rule of a melody; fitting long-term structure information by using a neural network with label control to generate long-term structure information corresponding to the label; and thirdly, training another neural network by utilizing the long-term structure information and the melody length information of the music data so that the neural network has the capacity of estimating the melody length information according to the long-term structure information. The invention generates the music melody with controllable long-term structure by utilizing the frequency domain characteristic of the pitch contour curve, and can realize the music distribution which is closer to and more realistic than the music generated by a long-short-time network.

Description

Music melody generation method based on pitch contour curve

Technical Field

The invention relates to the technical field of music generation, in particular to a music melody generation method based on a pitch contour curve.

Background

Music generation is always the direction of constant exploration by people in the field of computer arts. In the early days of computer development, people began to use traditional algorithms to implement music generation. In recent years, attempts have been increasingly made to use deep neural networks for music generation, such as long and short time memory networks, countermeasure generation networks, convolutional neural networks, and improved variational self-encoders. The performance of short-time music generated using these networks is quite excellent, however, the study of long-time music generation is slightly insufficient. How to make the melody of the generated long-time music have reasonable phrase arrangement, and satisfactory sequence and stable transition exist among different chapters, so that no good solution exists at present. In view of this, we propose a musical melody generation method based on pitch contour.

Disclosure of Invention

In order to make up for the defects, the invention provides a music melody generation method based on a pitch contour.

The technical scheme of the invention is as follows:

a music melody generation method based on a pitch contour curve comprises the following steps:

step one, extracting long-term structure information of a pitch contour in a frequency domain, wherein the long-term structure information comprises a low-frequency part in a frequency domain sequence of the pitch contour, and reflects a long-term trend rule of a melody;

fitting long-term structure information by using a neural network with label control to generate long-term structure information corresponding to the label;

training another neural network by utilizing the long-term structure information and the melody length information of the music data to enable the neural network to have the capacity of predicting the melody length information according to the long-term structure information;

determining the length of the generated target melody by using a trained neural network, and expanding the long-term structure in a frequency domain to obtain a rough melody curve;

step five, utilizing a vocabulary collected from the music data set to gradually perform vocabulary matching replacement on the obtained rough melody curve, and finally obtaining the music with optimized details.

As a preferred technical scheme of the invention, the specific steps of the long-term structure fitting network in the second step are as follows:

firstly, determining a proper length to realize the compression of a long-term structure, and finally unifying the compressed long-term structure to be 300 bits in length through reasonable selection;

then, the average value of the pitches of all the melodies is adjusted to be C3, namely 60, after the direct current component of the frequency domain sequence is deleted, the frequency domain sequence only has the information of the long-term characteristics of the melodies, and the separation from the melodies is realized;

then, separating the real axis from the virtual axis of the frequency domain sequence data, and recombining the frequency domain sequence data into a sequence with the length of 600;

finally, the label information is used for describing the height change of the long-term structure of the melody, and the height change is sent into the fitting network together with the corresponding long-term structure.

As a preferable technical scheme of the invention, the embedded layer network is used for realizing trend control of generating the long-term structure in the process of fitting the long-term structure into the network in the second step.

As a preferable technical scheme of the invention, the specific steps of the spiral length determination network in the step four are as follows:

firstly, generating a melody frequency domain sequence of a music piece by using a long-short time memory network;

then, a module for assisting in memorizing the low frequency is designed as a mark for stopping the long-short-time memorizing network and can be used as a reference mark of other frequency bands, on the basis, the module for assisting in memorizing the low frequency can be independently separated into an independent network module, and the possible length of the melody can be estimated from the low frequency part of the frequency domain sequence by utilizing the network;

then, determining a range for the melody length of the music of the training network, and normalizing the range output by the neural network by using the range;

finally, this length range is uniformly transformed to the output range (-1, 1) using the tanh activation function.

As a preferred technical scheme of the invention, the data format used for training the long-short memory network is one-sixteenth note length of the time step, the C3 map 60 pitch coding pitch contour curve, the long-short memory network uses RMSProp as an optimizer, and the length of the generated melody is 500.

As a preferred technical scheme of the invention, the specific steps of vocabulary matching in the step five are as follows:

firstly, counting the tunes of all melodies in a music library, and uniformly adjusting the tunes to be C major tunes;

then, cutting the melodies of the music according to the vocabulary length to obtain a corpus;

finally, the corpus is used for piecewise matching with the rough melody generated by utilizing the neural network, and the matching standard is the minimization of the mean square error.

The preferable technical scheme of the invention comprises the following parameter settings:

the label length is set to 10;

the noise input length is 100;

the length of the outputted frequency domain information is 600;

the frequency domain intensity scaling factor is set to 0.2;

the long-term structure fitting network uses an Adam optimizer for parameter optimization, and the learning rate of the Adam optimizer is set to be 1×e ^-4 。

The preferable technical scheme of the invention further comprises the following parameter settings:

length determination network usage parameter set to 1×e ^-4 Performing parameter optimization by an Adam optimizer of (a);

the length of vocabulary matching is set to 8, and classification labels are adopted for quick retrieval;

the melody length of the music is set to be between 300 bits and 3000 bits, and the corresponding melody duration is 40 seconds to 7 minutes.

Compared with the prior art, the invention has the beneficial effects that:

the invention generates the music melody with controllable long-term structure by utilizing the frequency domain characteristic of the pitch contour curve, and can realize the music distribution which is closer to and more realistic than the music generated by a long-short-time network.

Drawings

FIG. 1 is a basic framework diagram of the operational flow of the present invention;

FIG. 2 is a schematic diagram of a long-term structure-fitting network according to the present invention;

FIG. 3 is a schematic diagram of a length-determining network according to the present invention;

FIG. 4 is a diagram illustrating a vocabulary matching step according to the present invention;

FIG. 5 is a diagram of a long and short term memory network used in the comparative experiments of the present invention;

FIG. 6 is a schematic diagram of a method for calculating a rhythm transfer matrix according to the present invention;

fig. 7 is a diagram showing a long-term structure of a music melody and a corresponding label according to the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be understood that the terms "center", "longitudinal", "lateral", "length", "width", "thickness", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom", "inner", "outer", "clockwise", "counterclockwise", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings are merely for convenience in describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention.

In a specific operation process, as shown in fig. 1, a data set related to music data is obtained first, music in the data set is processed to obtain a compressed long-term structure, a long-term structure tag and a music length set, and a long-term structure fitting network is trained by using the long-term structure and the long-term structure tag; processing music in the data set to obtain a compressed long-term structure, a long-term structure label and a music length set, and training a fixed-length network by using the long-term structure and the music length set; a melody basic vocabulary is obtained from the dataset.

In a specific operation, as shown in fig. 2, the specific steps of the long-term structure fitting network in the second step are as follows:

It should be noted that, in the last step, the melody is uniformly divided into ten regions, the average value of the pitch of each region is compared with the average value of the pitch of the full melody, the region higher than the average value of the pitch is marked as 1, the region lower than the average value of the pitch is marked as 0, and finally the 10-bit length label is obtained.

It should be noted that, as shown in fig. 2, in the second step, the long-term structure is fitted by using the fully connected layer, before that, noise with a length sequence of 600 needs to be input, and meanwhile, the embedded layer network is used to realize the trend control for generating the long-term structure. The embedded layer network is a special neural network layer structure, and the neural network layer automatically updates the neuron connection weight adapting to the embedded layer according to the counter-propagation weight updating information. The input tag information may be encoded and mapped in a high-dimensional space to some extent so that other portions of the network can better understand and execute the information contained in the tag.

In a specific operation process, as shown in fig. 3, the specific steps of the disclination length determination network in the step four are as follows:

It should be noted that, the data format used for training the long-short memory network is one-sixteenth note length of the time step, the C3 maps 60 pitch-coded pitch contour curves, the long-short memory network uses RMSProp as an optimizer, and the generated melody length is 500.

In a specific operation process, as shown in fig. 4, the specific steps of vocabulary matching in the step five are as follows:

It should be noted that the above operation steps of the present invention include the following parameter settings:

the label length is set to 10;

the noise input length is 100;

the length of the outputted frequency domain information is 600;

the frequency domain intensity scaling factor is set to 0.2;

It should be noted that the above operation steps of the present invention further include the following parameter settings:

in addition, as shown in the rule diagram of the distribution of the melody length in the music library in fig. 7, the melody length of the music has obvious distribution rule, and the range of the melody length of the music is defined between 300 bits and 3000 bits, and the range of the corresponding melody duration is 40 seconds to 7 minutes.

A total of 120 initiatives were generated using the networks described herein for performance assessment by the comparative experiments described below. In consideration of the degree of optimization between networks, music generated by a long-short-term memory network of a three-layer structure as shown in fig. 5 was selected for comparison experiments. In consideration of the training time problem of the long-short-time memory network, the original music library is shortened and then used for training parameters of the long-short-time memory network. Similarly, the trained long and short duration memory network is used to generate 120 first melodies for performance comparison. There are many ways to count the relationship of the internal changes of the melody, but they are essentially rules describing the change of the melody. The statistical method of the rhythm and pitch transfer rule shown in fig. 6 is designed by referring to the thought of a Markov chain. The transition matrix size of the rhythm variation is set to 16 in consideration of the distribution state of the actual melody of the musical composition, corresponding to the length of one sixteenth note to one full note. Following the above concept, a calculation method of a pitch change transfer matrix can also be given, and the size of the pitch change transfer matrix is set to 12, corresponding to a pitch change of one half tone to a pitch change of one octave.

The following table shows the mean square error between the true values and the rhythm transfer matrix and pitch change transfer matrix in the method of the present invention and the long-short-time memory network method by using the performance statistics method described above.

By comparing the results, we can see; the music generated by the method provided by the invention is more similar to the real music distribution than the music generated by the long-short time network.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims

1. A music melody generation method based on a pitch contour curve is characterized by comprising the following steps of: the method comprises the following steps:

2. The method for generating a musical melody based on a pitch contour as defined in claim 1, wherein: the specific steps of the long-term structure fitting network in the second step are as follows:

3. The method for generating a musical melody based on a pitch contour as defined in claim 1, wherein: and step two, realizing trend control for generating the long-term structure by using an embedded layer network in the process of fitting the long-term structure into the network.

4. The method for generating a musical melody based on a pitch contour as defined in claim 1, wherein: the specific steps of the spiral length determination network in the fourth step are as follows:

5. The method for generating a musical melody based on a pitch contour as defined in claim 4, wherein: the data format used for training the long and short memory network is one-sixteenth note length of the time step, the C3 map 60 pitch codes the pitch contour curve, the long and short memory network uses RMSProp as an optimizer, and the length of the generated melody is 500.

6. The method for generating a musical melody based on a pitch contour as defined in claim 1, wherein: the specific steps of vocabulary matching in the fifth step are as follows:

7. The method for generating a musical melody based on a pitch contour as defined in claim 1, wherein: the method comprises the following parameter settings:

the label length is set to 10;

the noise input length is 100;

the length of the outputted frequency domain information is 600;

the frequency domain intensity scaling factor is set to 0.2;

8. The method for generating a musical melody based on a pitch contour as defined in claim 1, wherein: the method also comprises the following parameter settings: