CN115762449A

CN115762449A - Conditional music theme melody automatic generation method and system based on Transformer

Info

Publication number: CN115762449A
Application number: CN202211350721.6A
Authority: CN
Inventors: 王恒; 汪骁虎; 郝森; 油梦楠; 尤昕源
Original assignee: Wuhan Polytechnic University
Current assignee: Wuhan Polytechnic University
Priority date: 2022-10-31
Filing date: 2022-10-31
Publication date: 2023-03-07

Abstract

The invention discloses a conditional music theme melody automatic generation method and system based on a Transformer, which are used for carrying out uniform data format conversion on data in an original data set and screening to obtain a theme music data set; extracting music theme segments of theme music in the theme music data set; inputting the music theme segment sequence of each sample in the theme music data set into a transform encoder to obtain the output of the encoder; inputting the complete music sequence of each sample in the theme music data set into a transform decoder, wherein the input of the decoder consists of two parts, namely the complete music sequence of the sample and the output of the encoder, decoding to obtain a music sequence containing a music theme segment, and performing error calculation to update parameters; and repeatedly training the model until the model is terminated, inputting the music theme melody input by the user into the model as the conditional music theme segment sequence to obtain a complete music sequence containing the music theme segment sequence, and storing the file.

Description

Conditional music theme melody automatic generation method and system based on Transformer

Technical Field

The invention relates to the field of intelligent music generation, in particular to a conditional music theme melody automatic generation method and system based on a transform.

Background

Music is a time art, and a piece of graceful music is composed of multiple parts, among the multiple music expression factors, melody is the most important, so to speak, the soul of music, a piece of listening melody is repetitive, some segments in the melody repeatedly appear along with the time, and the repetitive segments often contain the emotion that the author wants to express, so the segments can be called as the theme of the music. However, the traditional music automatic generation method hardly considers the theme of music, and then, the generated melody is messy and lack of harmony, so that people cannot generate deep impression. For example Huang Songguo, the method adopted is to use the music file in the MIDI format as the data set, then to extract the characteristics of the MIDI file to obtain the melody characteristics, and to process the melody characteristics by using the algorithm generator to obtain the final melody, and the method only considers the melody characteristics of the note progression, the note dynamics, the motor position, the musical interval and the like, and the characteristics are only in the note level, but not the arrangement mode of the notes in the whole segment of the melody, so that the regularity and the harmony of the finally generated music are poor, and the theme style of the music cannot be embodied; for another example, zhao Zhou et al propose a music accompaniment automatic generation method based on coding-decoding network and a system thereof, the adopted music generation method still only considers music features at note level, does not take the overall structure and theme style of music as consideration factors of the generation stage, and only relies on a model to extract these potential features through a deep learning mode, but the generation mode is often inefficient, and the finally generated music quality is difficult to compare with real music.

In the current patent literature, an issued patent (CN 109727590B) provides a method and a device for generating music based on a recurrent neural network, and relates to the technical field of deep learning, wherein the method comprises the following steps: establishing a recurrent neural network; preprocessing a first note sequence in an MIDI format to obtain a training data set; training a circulating neural network through a training data set to obtain a neural network model; calculating and sequencing all note events of the input second note sequence through a neural network model and a sampling strategy, and outputting a third note sequence; decoding and converting the third note sequence into a note sequence file in a MIDI format; and converting the note sequence file into an output file in an audio format. According to the invention, music is generated through the neural network model, and by means of strong learning and expressing capabilities of the deep neural network and a sampling strategy, high-quality melody is quickly and effectively obtained, so that the original melody is conveniently generated by a user, and the music creation efficiency is effectively improved.

However, the invention has the defects that the invention uses the cyclic neural network to generate music, the cyclic neural network has the problem of gradient disappearance when processing a longer sequence, and the corresponding characteristics of the longer sequence cannot be effectively learned. Therefore, when the invention generates music, the generated music often does not have long-time structural connection, which affects the audition of the generated music and lacks practical value.

Therefore, the corresponding technical solutions are urgently needed to appear in the field.

Disclosure of Invention

The technical scheme adopted by the invention overcomes the defects of the prior art and provides a conditional music theme melody automatic generation scheme based on a Transformer.

The invention provides a conditional music theme melody automatic generation method based on a Transformer, which comprises the following steps,

step 1, performing unified data format conversion on data in an original data set, converting the data into a music data set in an MIDI format, filtering the data set in the MIDI format, screening out data samples with theme music, and filtering non-theme music to obtain a theme music data set;

step 2, extracting theme music in the theme music data set into music theme fragments;

step 3, inputting the music theme segment sequence of each sample in the theme music data set into a transform encoder, and encoding the music theme segment sequence to obtain the output of the encoder; the music theme fragment sequence is an initial music representation sequence obtained by processing the original MIDI format music file and is converted by a corresponding vocabulary table to obtain an integer array;

step 4, inputting the complete music sequence of each sample in the theme music data set into a transform decoder, wherein the complete music sequence of the sample is an initial music representation sequence obtained by processing the original MIDI format music file and is converted by a corresponding vocabulary table to obtain an integer array;

the input of the decoder consists of two parts, namely a sample complete music sequence and the output of the encoder; decoding the input sequence of the decoder to obtain a music sequence containing music subject segments, finally performing error calculation on the output sequence of the decoder and the original sample sequence, and synchronously updating parameters of the encoder and the decoder by utilizing a back propagation algorithm;

step 5, returning to repeat the steps 3 and 4, and repeatedly training the model until the model reaches the training termination condition; and inputting the music theme melody input by the user into the model as a conditional music theme segment sequence to obtain a complete music sequence containing the music theme segment sequence, and storing the complete music sequence as a specified music format file.

The implementation manner of the step 1 is that, for the MIDI format music data set after format conversion, each piece of music in the MIDI format is taken as a sample, the sample is encoded to obtain an initial music representation in the format of remii, the initial music representation is divided according to the bars, two bars are taken as a segment, each sample is divided into a plurality of segments, then the repetition frequency of each segment in the sample is calculated, the segment with the largest repetition frequency in the sample is taken as a representative segment, and if the repetition frequency of the representative segment reaches a preset threshold value, the sample is determined to have a music theme; and dividing all samples in the data set to obtain a theme music data set.

And the implementation manner of the step 2 is that each piece of subject music in MIDI format is taken as a sample, the sample is coded to obtain an initial music representation in REMI format, the sample is divided according to subsections, two subsections are taken as a segment, each sample is divided into a plurality of segments, word embedding is carried out on the divided samples to obtain corresponding word vectors, then clustering algorithm is used for each sample word vector to cluster the same or similar segments in the word vectors to obtain a plurality of clusters, the cluster with the largest number of segments is selected as a subject cluster, and then a segment closest to the mass center of the cluster in the subject cluster is selected as a music subject segment.

Furthermore, step 3 is implemented by that the transform encoder is composed of 6 identical encoding layers, each of which is composed of a multi-headed attention mechanism and a feedforward layer, and an overlap method and a regularization operation are used.

Moreover, the decoder consists of 6 identical decoding layers, each consisting of a multi-headed self-attention mechanism, a cross-attention mechanism and a feed-forward layer, and uses an overlap-add method and a regularization operation.

In another aspect, the present invention provides a transform-based conditional music theme melody automatic generation system, which is used to implement the transform-based conditional music theme melody automatic generation method.

Further, a processor and a memory are included, the memory is used for storing program instructions, and the processor is used for calling the stored instructions in the memory to execute the method for automatically generating the conditional music theme melody based on the Transformer.

Also, a readable storage medium is included, on which a computer program is stored, which, when executed, implements a Transformer-based conditional music theme melody automatic generation method as described above.

The advantages of the invention include:

1) The method has the advantages that the mode for generating the music containing the music theme style is provided, the theme of the music is used as a generating condition, the finally generated music is superior to other methods in the characteristic fineness of note level and the integral completeness of melody, and the quality of the generated music is improved.

2) Under the condition that the data volume of the data set is sufficient, the generated music containing the theme style has more diversity, can meet the personal requirements of users on the music theme style, can bring creation inspiration to professional musicians, and can provide customized services of the music theme style for common users.

3) The REMI music event representation mode, the transform model and the music data set in the MIDI format mentioned in the specific implementation scheme of the invention are open source resources, are easy to obtain and reduce the implementation difficulty of the invention.

Drawings

FIG. 1 is a flow chart of filtering an original music data set to obtain a subject music data set according to an embodiment of the present invention;

FIG. 2 is a flow chart of an embodiment of the present invention for extracting musical theme pieces from a theme music data set;

FIG. 3 is a diagram illustrating an embodiment of the present invention encoding an original music file to obtain a music sequence;

FIG. 4 is a model structure diagram of a theme music generation method based on a transform according to an embodiment of the present invention;

FIG. 5 is a schematic structural diagram of a theme music generation system based on a transform according to an embodiment of the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

The embodiment of the invention provides a conditional music theme melody automatic generation method based on a Transformer, which comprises the following steps,

further, the invention proposes a preferred scheme as follows: for the MIDI format music data set after format conversion, taking each piece of MIDI format music as a sample, coding the sample to obtain an initial music representation in a format of REMI, dividing the sample according to sections, taking two sections as a segment, dividing each sample into a plurality of segments, then calculating the repetition number of each segment in the sample, taking the segment with the most repetition number in the sample as a representative segment, and if the repetition number of the representative segment reaches a preset threshold (for example, 4 times) or more, judging that the sample has a music theme; and dividing all samples in the data set to obtain a theme music data set.

As shown in fig. 1, in the embodiment, the original music data set is preprocessed to obtain a theme music data set, and specifically, the preprocessing includes the following steps:

1) And acquiring an original music data set, and uniformly converting the music files in the original music data set into the music files in the MIDI format to obtain the music data set in the MIDI format.

2) And filtering the music in the music data set in the MIDI format to obtain a theme music data set. The specific filtering process includes converting each music file in the music data set in MIDI format into REMI event sequence, dividing the REMI event sequence in two sections to obtain equal-length REMI event sequence segments, calculating the repetition times of the segments, taking the segment with the most repetition times as the representative segment, judging the music file as theme music if the repetition times of the representative segment is greater than or equal to 4, and filtering all music in the data set to obtain the theme music data set.

further, the invention provides a preferable scheme that: taking each subject music in MIDI format as a sample, coding the sample to obtain an initial music representation in REMI format, dividing the sample according to bars, taking two bars as a segment, dividing each sample into a plurality of segments, performing word embedding on the divided samples to obtain corresponding word vectors, clustering the same or similar segments in the word vectors by using a clustering algorithm on each sample word vector to obtain a plurality of clusters, selecting the cluster with the most number of segments as a subject cluster, and then selecting the segment closest to the mass center of the cluster in the subject cluster as a music subject segment;

as shown in fig. 2, in the embodiment, for the theme music data set, music theme pieces of all music in the data set need to be acquired, specifically, the acquiring flow includes the following steps:

1) Coding each theme music sample in the theme music data set to obtain an REMI event sequence, dividing the obtained REMI event sequence by taking two sections as a unit to obtain a plurality of equal-length REMI event sequence segments, searching the corresponding event number by using an REMI vocabulary to obtain an integer array of the corresponding REMI event number, and then embedding words in the integer arrays to obtain a plurality of equal-length word vector segments.

2) And dividing the same or similar word vector segments into the same cluster by using a clustering algorithm for the acquired theme music word vector segments to obtain a plurality of different clusters, calculating the number of the word vector segments in each cluster, then taking the cluster with the most word vector segments as a theme cluster, and taking the segment closest to the centroid of the cluster in the theme cluster as the theme segment of the music.

And 3, inputting the music theme segment sequence of each sample in the theme music data set into a transform encoder, wherein the music theme segment sequence is an REMI format initial music representation sequence obtained after the original MIDI format music file is processed by a Python toolkit, and an integer array obtained after conversion through a corresponding vocabulary.

Further, the invention provides a preferable scheme that: the encoder consists of 6 identical encoding layers, each consisting of a multi-headed attention mechanism and a feedforward layer, and uses an overlap-add method and a regularization operation.

As shown in fig. 3, in the embodiment, a schematic diagram of a music sequence obtained by encoding an original music file is used to encode a music file of a music data set to obtain a music sequence in a corresponding format, and a specific encoding process includes the following steps:

1) The original music piece 301 is acquired from the music file.

2) The original music piece was parsed using the Python toolkit to get the REMI event sequence 302.

3) For each event in the sequence of REMI events 302, a REMI event vocabulary 305 is queried, resulting in an integer array 303 of corresponding REMI event numbers.

Specifically, the composition of the REMI event vocabulary 305 is: note-On _1,.. Once, note-On _127 represents 127 different pitch events, note-Duration _1,.. Once, note-Duration _64 represents 64 different Note Duration events, note-Velocity _1,. Once, note-Velocity _126 represents 126 different Note Velocity events, tempo _17,. Once, tempo _197 represents 60 different Velocity events composed of Velocity from 17 to 197 and divided at intervals of 3, position _0,. Once, position _15 represents 16 Position events dividing a music Bar into 16 different positions, bar represents the start of a music Bar, represented by a Bar event.

4) Word embedding the integer array 303 of REMI event numbers results in a word vector with dimension d _ model, where the specific value of d _ model can be set manually, usually 512.

Further, the REMI event sequence depicted in fig. 1 and 2 may be represented by 302; the REMI vocabulary, integer array of REMI event numbers, and word vector fragments depicted in FIG. 2 may be represented by 305,303,304, respectively.

Step 4, inputting the complete music sequence of each sample in the theme music data set into a transform decoder, wherein the complete music sequence of the sample is an initial music representation sequence obtained by processing the original MIDI format music file and is converted by a corresponding vocabulary table to obtain an integer array, and the input of the decoder consists of two parts which are respectively the complete music sequence of the sample and the output of the encoder; decoding the input sequence of the decoder to obtain a music sequence containing music theme segments, finally performing error calculation on the output sequence of the decoder and the original sample sequence, and synchronously updating parameters of the encoder and the decoder by using a back propagation algorithm;

further, the invention provides a preferable scheme that: the decoder consists of 6 identical decoding layers, each of which consists of a multi-head self-attention mechanism, a cross-attention mechanism and a feedforward layer, and uses an superposition method and a regularization operation.

As shown in fig. 4, a model structure of the transform-based theme music generation method is described, which is mainly composed of two parts, the left side describes an encoder part, and the right side describes a decoder part.

Further, the original input of the encoder is a sequence of musical theme segments, and the final input x of the encoder ^e The acquisition process comprises the following steps: taking the music theme segment sequence of each music sample in the theme music data set as the original input of the encoder, wherein the sample music theme segment sequence is an integer array obtained by converting an initial music representation sequence in a REMI format through a corresponding REMI vocabulary, and is recorded as s ^e Final input x of the encoder ^e ＝TE(s ^e ) + PS (d _ model), where TE () represents the word Embedding function (Token Embedding), PS () represents the Sinusoidal position Encoding function (Sinusoidal Positional Encoding), s ^e And d _ model represents the embedding dimension of the word embedding function.

The flow of the encoder processing data is as follows: first the final input x of the encoder is obtained ^e Then to x ^e Performing self-attention calculation, performing superposition and regularization, sending the output to a feedforward layer, performing superposition and regularization on the output of the feedforward layer to obtain the final output y of the encoder ^e 。

Further, the input of the decoder is composed of two parts, namely, the input x at the bottom end of the decoder ^d And the output y of the encoder ^e 。x ^d The acquisition process comprises the following steps: inputting the complete music sequence of each music sample in the subject music data set into a transform decoder, wherein the complete music sequence is an integer array obtained by converting a corresponding REMI vocabulary of an REMI format initial music representation sequence and is recorded as s ^d Bottom input x of decoder ^d ＝TE(s ^d ) + PS (d _ model), where TE () represents the word Embedding function (Token Embedding), PS () represents the Sinusoidal position Encoding function (Sinusoidal position Encoding), s ^d An integer array representing the initial musical representation corresponding to the complete musical sequence, d _ model representing the embedding dimension of the word embedding function, the output y of the encoder ^e As input to the decoder in computing the cross-attention.

The flow of the decoder processing data is as follows: first the final input x at the bottom of the decoder is obtained ^d Then to x ^d Carrying out multi-head attention calculation with mask, and then overlapping and regularizing the calculation result and the output y of the encoder ^e The output of the cross attention calculation module is output to a feedforward layer after superposition and regularization, the output result of the feedforward layer is output to a linear layer after superposition and regularization, and finally the output result is obtained through a Softmax function, namely the complete music sequence containing the music theme fragment sequence.

At each iteration, the model parameters will change randomly, and the stopping condition of training can be set as that the loss of the model is reduced to a certain value. In practice, the person skilled in the art can set the training stopping conditions.

The above process focuses on the generation of theme music, that is, the automatically generated music has better regularity and repeatability, conforms to the composing habit of the composer, and can bring better hearing enjoyment to listeners through an intelligent scheme.

In specific implementation, a person skilled in the art can implement the automatic operation process by using a computer software technology, and a system device for implementing the method, such as a computer-readable storage medium storing a corresponding computer program according to the technical solution of the present invention and a computer device including a corresponding computer program for operating the computer program, should also be within the scope of the present invention.

As shown in fig. 5, in some possible embodiments, there is provided a transform-based conditional music theme melody automatic generation system, comprising the following modules,

1) A conversion module: for the format conversion of the original data set, the music file format of the original data set can be WAV, MP3, WMA and other audio formats, and the music file format of the original data set is uniformly converted into MIDI (Musical Instrument Digital Interface) format by using a conversion module, so that the model and the computer can process uniformly.

2) A filtering module: and filtering the original MIDI format data set obtained from the conversion module, screening out theme music data, abandoning non-theme music data and obtaining a theme music data set.

3) An extraction module: and for each music sample of the theme music data set obtained from the filtering module, extracting a music theme segment corresponding to each theme music sample by using an extracting module.

4) Constructing a module: and building a transform-based music theme melody automatic generation model by using a building module, wherein the building of an encoder and a decoder and the coding module corresponding to an input sequence are included.

5) A training module: and inputting the music theme segment sequence and the complete music sequence into the model constructed by the construction module for training, and obtaining the finally trained model through the training module.

6) A generation module: the user inputs a section of music theme melody, then the music theme melody is sent to the extraction module to extract music theme segments to obtain a music theme segment sequence, and the music theme segment sequence is input into the trained model to generate a complete music sequence containing the music theme segment sequence.

7) An output module: and performing music format conversion on the music sequence output of the generating module to obtain the final playable music file output.

In some possible embodiments, a Transformer-based conditional music theme automatic generation system is provided, which includes a readable storage medium, on which a computer program is stored, and when the computer program is executed, the Transformer-based conditional music theme automatic generation method is implemented.

The specific embodiments described herein are merely illustrative of the spirit of the invention. Various modifications or additions may be made to the described embodiments or alternatives may be employed by those skilled in the art without departing from the spirit or ambit of the invention as defined in the appended claims.

Claims

1. A conditional music theme melody automatic generation method based on a Transformer is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

step 2, extracting the theme music in the theme music data set into music theme segments;

step 5, returning to repeat steps 3 and 4, and repeatedly training the model until the model reaches the training termination condition; and inputting the music theme melody input by the user into the model as a conditional music theme segment sequence to obtain a complete music sequence containing the music theme segment sequence, and storing the complete music sequence as a specified music format file.

2. The method of claim 1, wherein the conditional music theme melody based on Transformer is automatically generated by: the implementation manner of the step 1 is that, for the MIDI format music data set after format conversion, each piece of music in the MIDI format is taken as a sample, the sample is coded to obtain an initial music representation in the format of REMI, the initial music representation is divided according to sections, two sections are taken as a segment, each sample is divided into a plurality of segments, then the repetition frequency of each segment in the sample is calculated, the segment with the largest repetition frequency in the sample is taken as a representative segment, and if the repetition frequency of the representative segment reaches a preset threshold value, the sample is judged to have a music theme; and dividing all samples in the data set to obtain a theme music data set.

3. The method of claim 1, wherein the conditional music theme melody based on Transformer is automatically generated by: the implementation manner of the step 2 is that each piece of subject music in MIDI format is taken as a sample, the sample is coded to obtain an initial music representation in REMI format, the sample is divided according to subsections, two subsections are taken as a segment, each sample is divided into a plurality of segments, word embedding is carried out on the divided samples to obtain corresponding word vectors, then clustering algorithm is used for each sample word vector to cluster the same or similar segments in the word vectors to obtain a plurality of clusters, the cluster with the largest number of segments is selected as a subject cluster, and then a segment closest to the centroid of the cluster in the subject cluster is selected as a music subject segment.

4. The method of claim 1, wherein the conditional music theme melody based on Transformer is automatically generated by: the implementation manner of the step 3 is that the encoder of the Transformer is composed of 6 identical encoding layers, each encoding layer is composed of a multi-head self-attention mechanism and a feedforward layer, and an superposition method and a regularization operation are used.

5. The method of claim 4, wherein the conditional music theme melody based on Transformer is automatically generated by: the decoder consists of 6 identical decoding layers, each consisting of a multi-headed self-attention mechanism, a cross-attention mechanism and a feed-forward layer, and uses an overlap-add and regularization operation.

6. A conditional music theme melody automatic generation system based on a Transformer is characterized in that: for implementing a transform-based conditional music theme melody automatic generation method according to any of claims 1 to 5.

7. The Transformer-based conditional music theme melody automatic generation system of claim 6, wherein: comprising a processor and a memory, the memory being used for storing program instructions, the processor being used for calling the stored instructions in the memory to execute a Transformer-based conditional music theme melody automatic generation method according to any of claims 1 to 5.

8. The Transformer-based conditional music theme melody automatic generation system of claim 6, wherein: comprising a readable storage medium having stored thereon a computer program which, when executed, implements a method for the transform-based conditional music theme melody automatic generation according to any one of claims 1 to 5.