CN111653256B

CN111653256B - Music accompaniment automatic generation method and system based on coding-decoding network

Info

Publication number: CN111653256B
Application number: CN202010795908.1A
Authority: CN
Inventors: 赵洲; 何金铮; 任意
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2020-12-08
Anticipated expiration: 2040-08-10
Also published as: CN111653256A

Abstract

The invention discloses a music accompaniment automatic generation method and a system thereof based on a coding-decoding network, belonging to the field of music design. The method mainly comprises the following steps: 1) and training a coding neural network formed by multiple encoders for coding and learning the characteristics of multi-track (or single-track) music aiming at given original music data. 2) And training a decoding neural network formed by a plurality of decoders for the multi-track (or single-track) music characteristics output by the encoder after the encoding is finished and the finally obtained music. Compared with the traditional method based on the sequence model, the invention utilizes the long-distance cache mechanism to better model the long music, and simultaneously utilizes the coding-decoding network to realize a more rapid training process. The effect achieved by the present invention in the problem of multi-track (or single-track) musical accompaniment is better than that achieved by the conventional method.

Description

Music accompaniment automatic generation method and system based on coding-decoding network

Technical Field

The invention relates to the field of music design, in particular to a music accompaniment automatic generation method and a music accompaniment automatic generation system based on a coding-decoding network.

Background

Music is an artistic form and cultural activity, and its medium is sound organized on time. The general definition of music includes common elements such as pitch (which governs melodies and harmony), tempo (and its associated conceptual tempo, meter, and articulation), dynamics (loudness and softness), and sound quality tone and texture (this is sometimes referred to as the "color" of the musical sound). Different styles or types of music may emphasize, de-emphasize, or ignore certain elements thereof. Music is played with a variety of instruments and music techniques. The problem of accompaniment with music is a challenging problem in various areas of music technology, and automatic accompaniment technology aims to solve the problem of generating accompaniment tracks for a given main track. With the rapid development of artificial intelligence and deep learning techniques, many automatic accompaniment methods have been proposed.

An automatic accompanying technique based on generation of a confrontation network is proposed in the prior art. The method regards MIDI as a piano rolling graph, the horizontal axis is time, the vertical axis is pitch, 0 and 1 are used for representing the current position and whether the tone symbol exists on the pitch or not for triggering, meanwhile, a generation countermeasure network technology widely used in the image field is introduced in the process of generating music, and the problem of musical accompaniment is solved by designing a multi-track model and a time sequence model. The multi-track model is responsible for the interdependencies of the multi-tracks, while the timing model handles the timing dependencies. However, the data of the piano rolling graph is sparse, the training is unstable, and the experimental result also shows that the quality of the music accompaniment generated by the method is limited.

Meanwhile, in the field of automatic composition, a plurality of methods based on sequence models are also proposed by predecessors, but only one decoder is adopted to decode one music track, so that only continuous writing can be realized for single-track music, and automatic accompaniment of the music cannot be realized.

In summary, the mainstream automatic accompaniment and automatic composition technology at present cannot meet the requirement of generating high-quality multi-track automatic accompaniment, which becomes a huge problem in the current automatic accompaniment field.

Disclosure of Invention

The present invention is directed to solving the problems of the prior art and, in order to overcome the disadvantages of the prior art that a high quality automatic accompaniment cannot be generated well, the present invention provides a music accompaniment automatic generation method and system based on a coding-decoding network.

The invention discloses a music accompaniment automatic generation method based on a coding-decoding network, which comprises the following steps.

1) A training set of source music is obtained, each music as a training sample, the training samples labeled as a master track and accompaniment tracks.

2) And establishing a coding-decoding network structure which comprises a coding neural network and a decoding neural network, wherein the coding neural network is formed by a plurality of coders, and the decoding neural network is formed by a plurality of decoders.

3) Reading source music and coding the source music into different initial music representations according to different music formats to obtain source music coding representations; then obtaining the main track statement (x) from the source music coding representation₁,…,x_M) And the accompaniment track (y) of the source music coding representation sentence₁,…,y_N) Wherein x is_iDenotes the ith symbol in the main track sentence, M denotes the number of symbols in the main track sentence, y_iThe ith in accompaniment track sentence representing source musicNotes, N representing the number of notes in the accompaniment track, and finally embedding the main track (x) by words₁,…,x_M) And accompaniment tracks (y)₁,…,y_N) Into an embedded vector.

4) And inputting the embedded vector of the main audio track into a neural network of an encoder to carry out N-step encoding, storing a hidden layer sequence obtained by calculating a symbol positioned in front in a main audio track sentence in the encoder into a cache in the encoding process, introducing data in the cache into the encoding process of a symbol behind, and repeatedly encoding for N times to obtain the output of the encoder as the music representation after the main audio track encoding.

The method comprises the steps of taking a music representation after main track coding and an embedded vector of an accompaniment track as input of a decoding neural network to carry out N-step decoding, carrying out mask processing on a cross attention module in a decoder in the decoding process, storing a hidden layer sequence obtained by calculating a symbol positioned in front in an accompaniment track sentence in the decoder into a cache, introducing data in the cache into the decoding process of a symbol behind, repeating the decoding for N times, comparing output results of all decoders with the accompaniment track, and training the coding neural network and the decoding neural network until a trained coding-decoding network model is obtained.

5) Acquiring an embedded vector of a music main audio track aiming at the music without accompaniment to be processed, taking the embedded vector as the input of a trained coding-decoding network model, and outputting the music accompaniment; and synthesizing the obtained musical accompaniment and the unaccompanied music into music containing complete accompaniment.

Another object of the present invention is to provide an automatic generation system of musical accompaniment for implementing the above method.

The method comprises the following steps:

music sample collection module: the music data acquisition device is used for acquiring music data as a source music training set or to-be-processed accompaniment-free music, wherein the source music training set comprises condition music tracks and target music tracks, the condition music tracks are marked as main music tracks, and the target music tracks are marked as accompaniment music tracks.

Encoding-decoding network module: the system is provided with an encoding neural network and a decoding neural network, wherein the encoding neural network is composed of a plurality of encoders, and the decoding neural network is composed of a plurality of decoders.

The music format preprocessing module: different music preprocessing modes are configured, and the corresponding preprocessing modes are selected by recognizing different music formats.

A cache module: for buffering the hidden layer output information of each encoder and decoder.

A training module: for updating the parameters of the coding-decoding network module during the training phase.

The music synthesis module: the music accompaniment and the unaccompanied music are synthesized into music containing complete accompaniment and output.

Compared with the prior art, the invention has the following beneficial effects.

(1) The invention creatively applies the coding-decoding structure to the field of music automatic accompaniment, compared with the traditional method based on the piano rolling graph, the method adopts the sequence model, can more effectively represent music data, and the sequence of the method based on the piano rolling graph is very unstable, but the training of the invention is stable and has better generalization.

(2) In the prior art, a sequence model applied to the related field only adopts a decoder, only one audio track can be decoded, and only continuous writing aiming at single-track music can be realized; the invention adopts a coding-decoding structure, and uses the coder to code the main track, so that the invention can better model the context information of the main track of music, and ensure the performance of automatic accompaniment.

(3) The encoder and the decoder adopted by the invention are realized based on the traditional transform model, but the transform processes the current fixed-length segment every time, so that the method is not suitable for encoding and decoding the music track; the invention can store the previously processed segment information, combine the previously processed segment information with the current segment information and then process the segment information, realizes the fusion of the audio track context information in such a way, enables the model to process longer segments, is suitable for longer music data, and improves the processing precision by utilizing the fused context information.

(4) The present invention employs a bar level attention mechanism where notes within a bar tend to be highly correlated in music, and therefore applies masking to the cross attention module during training to ensure that each symbol in the decoder sees only the conditional context of the same bar, thus avoiding concern for unimportant other information affecting the decoding process.

Drawings

Fig. 1 is a schematic diagram of an encoding-decoding network employed by the present invention.

Detailed Description

The invention will be further elucidated and described with reference to the drawings and the detailed description.

As shown in fig. 1, the present invention utilizes a music accompaniment automatic generation method based on an encoding-decoding network, including the following steps.

1) For input multi-track (or single-track) music, a coding neural network composed of multiple encoders is trained.

2) For input multi-track (or single-track) music, obtaining the output of the coding neural network; the output is combined with the target accompaniment training decoding neural network to be finally generated.

3) For multi-track (or single-track) music without accompaniment to be processed, generating a music accompaniment according to the coding neural network and the decoding neural network, and synthesizing the obtained music accompaniment and the accompaniment-free music into music containing complete accompaniment.

In one embodiment of the present invention, a process of building an encoding-decoding network structure and a process of preprocessing music data are introduced.

A training process of the encoding neural network and the decoding neural network.

First, a training set of source music is obtained, each music being a training sample, each music training sample including a condition track and a target track, wherein the condition track is a main track and is an input of the model, and the target track is an accompaniment track and is an output of the model.

Secondly, establishing a coding neural network and a decoding neural network, wherein the coding neural network is composed of a plurality of encoders, and the decoding neural network is composed of a plurality of decoders; preferably a Transformer encoder and decoder architecture. However, because the Transformer processes the current fixed-length segment every time and is not suitable for processing long-length music data, on the basis of the Transformer, the segment information processed last time is stored, combined with the current segment information and processed, and better fusion of context information is realized.

The source music training set is then processed.

The source music is read and encoded into different music representations (such as REMI representations or pinanoll-based representations) according to different music formats. Specifically, for different music formats, such as MP3, WAV, etc., music is converted into music sequences by different music reading techniques, and then the music sequences are converted into coding sequences according to different music coding modes, such as MIDI or REMI coding, etc. For example, the MIDI-Like code encodes a music sequence into a code sequence consisting of SET _ release, not _ ON, TIME _ SHIFT, not _ OFF, etc., such as: SET _ VELOCITY <100>, NOTE _ ON <70>, TIME _ SHIFT <500>, NOTE _ ON <74>, TIME _ SHIFT <500>, NOTE _ OFF <70>, NOTE _ OFF <67> … ….

The master track statement (x) of the source music coding representation is then obtained from the source music training set₁,…,x_M) And the accompaniment track (y) of the source music coding representation sentence₁,…,y_N) Wherein x is_iDenotes the ith symbol in the main track sentence, M denotes the number of symbols in the main track sentence, y_iThe ith note in the accompaniment track sentence representing the source music, N represents the number of notes in the accompaniment track, and finally the main track (x) is embedded by words₁,…,x_M) And accompaniment tracks (y)₁,…,y_N) Converting into an embedded vector; in practice, because the main track and the accompaniment tracks are long, the present invention equally divides the main track and the accompaniment tracks into a plurality of sections, and only one section is encoded and decoded each time, so that the plurality of sections of accompaniment tracks are finally combined into the whole accompaniment track.

In one embodiment of the present invention, a training process for an encoding-decoding network architecture is presented.

And inputting the embedded vector of the main audio track into a neural network of an encoder to carry out N-step encoding, storing a hidden layer sequence obtained by calculating a symbol positioned in front in a main audio track sentence in the encoder into a cache in the encoding process, introducing data in the cache into the encoding process of a symbol behind, and repeatedly encoding for N times to obtain the output of the encoder as the music representation after the main audio track encoding.

The method comprises the steps of performing N-step decoding on a music representation coded by a main track by taking an embedded vector of an accompaniment track as the input of a decoding neural network, performing mask processing on a cross attention module in a decoder in the decoding process, storing a hidden layer sequence obtained by calculating a symbol positioned in front in an accompaniment track sentence in the decoder into a cache, introducing data in the cache into the decoding process of a symbol behind, repeating the decoding for N times, comparing output results of all decoders with the accompaniment track, and training the coding neural network and the decoding neural network until a trained coding-decoding network model is obtained.

Introducing the data in the buffer memory into the encoding process or the decoding process of the following symbol, specifically:

firstly, connecting data in a cache which is not subjected to gradient calculation with the output of a previous hidden layer, wherein the connected data are respectively used as the input of a K channel and a V channel in a next self-attention layer, and the output of the previous hidden layer is used as the input of a Q channel in the next self-attention layer to obtain the output of the next hidden layer; and the rest is repeated until the N-layer coding or the N-layer decoding is finished.

Further, the encoder encodes each symbol in the conditional track (input single-track or multi-track music) at each step, and during training, the hidden layer sequence calculated from the previous symbol is stored in a buffer and fixed, and this part of the buffer is used as additional context information. The output of the encoder will be the input to the decoder. Formula of encoder

Wherein the content of the first and second substances,

and representing the code buffer of the ith code. This part of the buffer is calculated in the previous steps. The specific calculation process is as follows: in the training phase, each hidden layer receives two inputs, the output of the previous hidden layer of the segment and the output of the previous hidden layer of the segment, when processing the following segment. These two inputs are concatenated and then used to calculate the key and value of the current segment, as follows:

τ represents the τ th piece of music, n represents the n-th layer of the encoder, h represents the output of the hidden layer, SG (-) represents the stopping gradient calculation, W is the model parameter,

is a vector stitching operation.

Is the q, k, v vector in the encoder,

is the output of the (t + 1) th piece of music on the (n-1) th layer,

is the output of the segment # at layer n-1,

is the result of the splicing process,

is the output of the τ +1 th piece of music at the n level.

The encoder employed in the present invention employs relative position coding, i.e., coding based on the relative distance between words rather than the absolute position as in a transform. The calculation of the relative position code is as follows:

wherein the content of the first and second substances,

embedded vectors of the ith and jth symbols, R, respectively_i-jIs the relative position vector of the ith and jth symbols, W_k,R、W_k,E、W_qIs a parameter matrix, which is a matrix that can be learned; u and v are parameter vectors, which are learnable vectors; t denotes the transpose of the image,

is the relative position coding result of the ith symbol and the jth symbol.

The decoder aims to generate the present symbol from the previous symbol and the context semantics from the encoder. During the training process, a mask is applied to the cross-attention module to ensure that each symbol in the decoder sees only the conditional context of the same bar. Calculation formula of decoder

Representing the decoder buffer, which is the result of the previous steps of the calculation.

Stands for smallAll encoder outputs within a section. The specific calculation process is as follows: in the training phase, each hidden layer receives two inputs, the output of the previous hidden layer of the segment and the output of the previous hidden layer of the segment, when processing the following segment. These two inputs are concatenated and then used to calculate the key and value of the current segment, as follows:

τ represents the τ th piece of music, n represents the n-th layer of the decoder, h represents the output of the hidden layer, SG (-) represents the stopped gradient calculation, W is the model parameter,

is a vector stitching operation.

Are the q, k, v vectors in the decoder,

is the output of the (t + 1) th piece of music on the (n-1) th layer,

is the output of the segment # at layer n-1,

is the result of the splicing process,

is the output of the τ +1 th piece of music at the n level.

The decoder also uses the concept of relative position coding, i.e. coding based on the relative distance between words rather than the absolute position as in a transform. The calculation of the relative position code is as follows:

wherein the content of the first and second substances,

is an embedded vector of symbols i, j, R_i-jIs a relative position vector and the other parameters are learnable vectors or matrices.

In another embodiment of the present invention, a music accompaniment automatic generation system based on an encoding-decoding network is provided.

The method specifically comprises the following steps:

Wherein, the music format preprocessing module comprises:

an initial characterization generation module: for reading source music and encoding into different initial music representations.

A word embedding module: for converting the statements of the main track and the accompaniment track into an embedded vector.

The coding neural network in the coding-decoding network module is composed of a plurality of coders, in the coding process, a hidden layer sequence obtained by calculating a symbol positioned in front in a main track sentence in the coders is stored in a cache module, and data in the cache is introduced into the coding process of a symbol behind, and coding is repeated for N times.

The decoding neural network in the coding-decoding network module is composed of a plurality of decoders, in the decoding process, mask processing is carried out on a cross attention module in the decoders, a hidden layer sequence obtained by calculating a symbol positioned in the front in an accompaniment track sentence in the decoder is stored in a cache module, data in the cache is introduced into the decoding process of a symbol positioned in the rear, and decoding is repeated for N times.

The modules are communicatively connected through some interfaces, and may be in an electrical or other form. The division of modules is not limited to the specific embodiments provided herein.

The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.

Examples

To further demonstrate the effectiveness of the present invention, the present invention was experimentally verified on the LMD data set, which contains 21916 music pieces, 372339 bars, and a total duration of 255.13 hours. The music coding adopted by the invention is MIDI-Like music coding based on event sequences. In order to verify the effectiveness of the invention, a plurality of evaluation indexes are designed through experiments, and objective evaluation comprises Chord Accuracy (CA) and average harmonic height overlapping area (D)_p) Volume average overlap region (D)_v) Time-averaged overlap region (D)_d)。

TABLE 1 results of the experiment

	CA	D_p	D_d
				Method for producing a composite material	0.45±0.01	0.58±0.01	0.55±0.01
MuseGAN	0.37±0.02	0.21±0.01	0.35±0.01

Table 1 shows the evaluation results of the present invention, where MuseGAN is a method based on a piano rolling graph, and it can be seen that the results of the present invention are greatly superior to those of MuseGAN, which indicates that the present invention has achieved a breakthrough success in creatively applying the encoding-decoding structure to the field of automatic music accompaniment.

Claims

1. A music accompaniment automatic generation method based on a coding-decoding network, characterized by comprising the following steps:

1) acquiring a source music training set, wherein each piece of music is used as a training sample, and the training samples are marked as a main track and an accompaniment track;

2) establishing a coding-decoding network structure which comprises a coding neural network and a decoding neural network, wherein the coding neural network is composed of a plurality of coders, and the decoding neural network is composed of a plurality of decoders; the encoder and the decoder adopt the structures of the encoder and the decoder in a Transformer;

3) reading source music and coding the source music into different initial music representations according to different music formats to obtain source music coding representations; then, acquiring a main track statement and an accompaniment track statement from the source music coding representation; finally, converting the main track sentences and the accompaniment track sentences into embedded vectors through word embedding;

4) inputting the embedded vector of the main audio track into a neural network of an encoder to carry out N-step encoding, storing a hidden layer sequence obtained by calculating a symbol positioned in front in a main audio track sentence in the encoder into a cache in the encoding process, introducing data in the cache into the encoding process of a symbol behind, firstly connecting data in the cache without gradient calculation with the output of a previous hidden layer, respectively taking the connected data as the input of a K channel and a V channel in a next self-attention layer, and taking the output of the previous hidden layer as the input of a Q channel in the next self-attention layer to obtain the output of the next hidden layer; repeating the steps until N layers of coding are completed; after repeating the encoding for N times, obtaining the output of the encoder as the music representation after the main track encoding;

the method comprises the steps that a music representation after main track coding and an embedded vector of an accompaniment track are used as input of a decoding neural network to carry out N-step decoding, in the decoding process, mask processing is carried out on a cross attention module in a decoder, a hidden layer sequence obtained by calculation of a symbol positioned in front in an accompaniment track sentence in the decoder is stored in a cache, data in the cache is introduced into the decoding process of a symbol behind, firstly, data in the cache without gradient calculation are connected with output of a previous hidden layer, the connected data are respectively used as input of a K channel and a V channel in a next layer of self-attention layer, the output of the previous layer of hidden layer is used as input of a Q channel in the next layer of self-attention layer, and output of the next layer of hidden layer is obtained; repeating the steps until the decoding of the N layers is completed; after repeating decoding for N times, comparing output results of all decoders with accompaniment tracks, and training a coding neural network and a decoding neural network until a trained coding-decoding network model is obtained;

2. The method according to claim 1, wherein the initial music representation obtained in step 3) is obtained by: firstly, according to different music formats, a corresponding music reading technology is adopted to obtain a music sequence, and then MIDI or REMI coding is adopted to obtain an encoded music sequence which is used as an initial music representation.

3. The method as claimed in claim 1, wherein the encoded neural network employs a relative position encoding method.

4. The method of claim 3, wherein the calculation formula of the relative position coding mode is:

wherein the content of the first and second substances,

respectively the embedded vectors of the ith and jth symbols,

is the relative position vector of the ith symbol and the jth symbol,

is a matrix of parameters that is,

and

in the form of a vector of parameters,

the transpose is represented by,

is the relative position coding result of the ith symbol and the jth symbol.

5. An automatic generation system of musical accompaniment for implementing the method of claim 1, comprising:

music sample collection module: the music data acquisition device is used for acquiring music data as a source music training set or to-be-processed accompaniment-free music, wherein the source music training set comprises a condition audio track and a target audio track, the condition audio track is marked as a main audio track, and the target audio track is marked as an accompaniment audio track;

encoding-decoding network module: configuring a coding neural network and a decoding neural network, wherein the coding neural network is composed of a plurality of encoders, in the coding process, a hidden layer sequence obtained by calculating a symbol positioned in front in a main track sentence in the encoders is stored in a cache module, and data in the cache is introduced into the coding process of a symbol behind, and the coding is repeated for N times; the decoding neural network is composed of a plurality of decoders, in the decoding process, mask processing is carried out on a cross attention module in the decoders, a hidden layer sequence obtained by calculating a symbol positioned in front in an accompaniment track sentence in the decoder is stored in a cache module, data in the cache is introduced into the decoding process of a symbol behind, and decoding is repeated for N times;

the music format preprocessing module: configuring different music preprocessing modes, and selecting the corresponding preprocessing modes by identifying different music formats;

a cache module: for caching hidden layer output information of each encoder and decoder;

a training module: for updating the parameters of the coding-decoding network module during the training phase;

6. The system for automatic generation of musical accompaniment according to claim 5, wherein said music format preprocessing module comprises:

an initial characterization generation module: for reading source music and encoding into different initial music representations;