CN111653256B - Music accompaniment automatic generation method and system based on coding-decoding network - Google Patents

Music accompaniment automatic generation method and system based on coding-decoding network Download PDF

Info

Publication number
CN111653256B
CN111653256B CN202010795908.1A CN202010795908A CN111653256B CN 111653256 B CN111653256 B CN 111653256B CN 202010795908 A CN202010795908 A CN 202010795908A CN 111653256 B CN111653256 B CN 111653256B
Authority
CN
China
Prior art keywords
music
coding
accompaniment
track
decoding
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010795908.1A
Other languages
Chinese (zh)
Other versions
CN111653256A (en
Inventor
赵洲
何金铮
任意
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202010795908.1A priority Critical patent/CN111653256B/en
Publication of CN111653256A publication Critical patent/CN111653256A/en
Application granted granted Critical
Publication of CN111653256B publication Critical patent/CN111653256B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0033Recording/reproducing or transmission of music for electrophonic musical instruments
    • G10H1/0041Recording/reproducing or transmission of music for electrophonic musical instruments in coded form
    • G10H1/0058Transmission between separate instruments or between individual components of a musical system
    • G10H1/0066Transmission between separate instruments or between individual components of a musical system using a MIDI interface
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • G10H2210/111Automatic composing, i.e. using predefined musical rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/011Files or data streams containing coded musical information, e.g. for transmission
    • G10H2240/046File format, i.e. specific or non-standard musical file format used in or adapted for electrophonic musical instruments, e.g. in wavetables

Abstract

The invention discloses a music accompaniment automatic generation method and a system thereof based on a coding-decoding network, belonging to the field of music design. The method mainly comprises the following steps: 1) and training a coding neural network formed by multiple encoders for coding and learning the characteristics of multi-track (or single-track) music aiming at given original music data. 2) And training a decoding neural network formed by a plurality of decoders for the multi-track (or single-track) music characteristics output by the encoder after the encoding is finished and the finally obtained music. Compared with the traditional method based on the sequence model, the invention utilizes the long-distance cache mechanism to better model the long music, and simultaneously utilizes the coding-decoding network to realize a more rapid training process. The effect achieved by the present invention in the problem of multi-track (or single-track) musical accompaniment is better than that achieved by the conventional method.

Description

Music accompaniment automatic generation method and system based on coding-decoding network
Technical Field
The invention relates to the field of music design, in particular to a music accompaniment automatic generation method and a music accompaniment automatic generation system based on a coding-decoding network.
Background
Music is an artistic form and cultural activity, and its medium is sound organized on time. The general definition of music includes common elements such as pitch (which governs melodies and harmony), tempo (and its associated conceptual tempo, meter, and articulation), dynamics (loudness and softness), and sound quality tone and texture (this is sometimes referred to as the "color" of the musical sound). Different styles or types of music may emphasize, de-emphasize, or ignore certain elements thereof. Music is played with a variety of instruments and music techniques. The problem of accompaniment with music is a challenging problem in various areas of music technology, and automatic accompaniment technology aims to solve the problem of generating accompaniment tracks for a given main track. With the rapid development of artificial intelligence and deep learning techniques, many automatic accompaniment methods have been proposed.
An automatic accompanying technique based on generation of a confrontation network is proposed in the prior art. The method regards MIDI as a piano rolling graph, the horizontal axis is time, the vertical axis is pitch, 0 and 1 are used for representing the current position and whether the tone symbol exists on the pitch or not for triggering, meanwhile, a generation countermeasure network technology widely used in the image field is introduced in the process of generating music, and the problem of musical accompaniment is solved by designing a multi-track model and a time sequence model. The multi-track model is responsible for the interdependencies of the multi-tracks, while the timing model handles the timing dependencies. However, the data of the piano rolling graph is sparse, the training is unstable, and the experimental result also shows that the quality of the music accompaniment generated by the method is limited.
Meanwhile, in the field of automatic composition, a plurality of methods based on sequence models are also proposed by predecessors, but only one decoder is adopted to decode one music track, so that only continuous writing can be realized for single-track music, and automatic accompaniment of the music cannot be realized.
In summary, the mainstream automatic accompaniment and automatic composition technology at present cannot meet the requirement of generating high-quality multi-track automatic accompaniment, which becomes a huge problem in the current automatic accompaniment field.
Disclosure of Invention
The present invention is directed to solving the problems of the prior art and, in order to overcome the disadvantages of the prior art that a high quality automatic accompaniment cannot be generated well, the present invention provides a music accompaniment automatic generation method and system based on a coding-decoding network.
The invention discloses a music accompaniment automatic generation method based on a coding-decoding network, which comprises the following steps.
1) A training set of source music is obtained, each music as a training sample, the training samples labeled as a master track and accompaniment tracks.
2) And establishing a coding-decoding network structure which comprises a coding neural network and a decoding neural network, wherein the coding neural network is formed by a plurality of coders, and the decoding neural network is formed by a plurality of decoders.
3) Reading source music and coding the source music into different initial music representations according to different music formats to obtain source music coding representations; then obtaining the main track statement (x) from the source music coding representation1,…,xM) And the accompaniment track (y) of the source music coding representation sentence1,…,yN) Wherein x isiDenotes the ith symbol in the main track sentence, M denotes the number of symbols in the main track sentence, yiThe ith in accompaniment track sentence representing source musicNotes, N representing the number of notes in the accompaniment track, and finally embedding the main track (x) by words1,…,xM) And accompaniment tracks (y)1,…,yN) Into an embedded vector.
4) And inputting the embedded vector of the main audio track into a neural network of an encoder to carry out N-step encoding, storing a hidden layer sequence obtained by calculating a symbol positioned in front in a main audio track sentence in the encoder into a cache in the encoding process, introducing data in the cache into the encoding process of a symbol behind, and repeatedly encoding for N times to obtain the output of the encoder as the music representation after the main audio track encoding.
The method comprises the steps of taking a music representation after main track coding and an embedded vector of an accompaniment track as input of a decoding neural network to carry out N-step decoding, carrying out mask processing on a cross attention module in a decoder in the decoding process, storing a hidden layer sequence obtained by calculating a symbol positioned in front in an accompaniment track sentence in the decoder into a cache, introducing data in the cache into the decoding process of a symbol behind, repeating the decoding for N times, comparing output results of all decoders with the accompaniment track, and training the coding neural network and the decoding neural network until a trained coding-decoding network model is obtained.
5) Acquiring an embedded vector of a music main audio track aiming at the music without accompaniment to be processed, taking the embedded vector as the input of a trained coding-decoding network model, and outputting the music accompaniment; and synthesizing the obtained musical accompaniment and the unaccompanied music into music containing complete accompaniment.
Another object of the present invention is to provide an automatic generation system of musical accompaniment for implementing the above method.
The method comprises the following steps:
music sample collection module: the music data acquisition device is used for acquiring music data as a source music training set or to-be-processed accompaniment-free music, wherein the source music training set comprises condition music tracks and target music tracks, the condition music tracks are marked as main music tracks, and the target music tracks are marked as accompaniment music tracks.
Encoding-decoding network module: the system is provided with an encoding neural network and a decoding neural network, wherein the encoding neural network is composed of a plurality of encoders, and the decoding neural network is composed of a plurality of decoders.
The music format preprocessing module: different music preprocessing modes are configured, and the corresponding preprocessing modes are selected by recognizing different music formats.
A cache module: for buffering the hidden layer output information of each encoder and decoder.
A training module: for updating the parameters of the coding-decoding network module during the training phase.
The music synthesis module: the music accompaniment and the unaccompanied music are synthesized into music containing complete accompaniment and output.
Compared with the prior art, the invention has the following beneficial effects.
(1) The invention creatively applies the coding-decoding structure to the field of music automatic accompaniment, compared with the traditional method based on the piano rolling graph, the method adopts the sequence model, can more effectively represent music data, and the sequence of the method based on the piano rolling graph is very unstable, but the training of the invention is stable and has better generalization.
(2) In the prior art, a sequence model applied to the related field only adopts a decoder, only one audio track can be decoded, and only continuous writing aiming at single-track music can be realized; the invention adopts a coding-decoding structure, and uses the coder to code the main track, so that the invention can better model the context information of the main track of music, and ensure the performance of automatic accompaniment.
(3) The encoder and the decoder adopted by the invention are realized based on the traditional transform model, but the transform processes the current fixed-length segment every time, so that the method is not suitable for encoding and decoding the music track; the invention can store the previously processed segment information, combine the previously processed segment information with the current segment information and then process the segment information, realizes the fusion of the audio track context information in such a way, enables the model to process longer segments, is suitable for longer music data, and improves the processing precision by utilizing the fused context information.
(4) The present invention employs a bar level attention mechanism where notes within a bar tend to be highly correlated in music, and therefore applies masking to the cross attention module during training to ensure that each symbol in the decoder sees only the conditional context of the same bar, thus avoiding concern for unimportant other information affecting the decoding process.
Drawings
Fig. 1 is a schematic diagram of an encoding-decoding network employed by the present invention.
Detailed Description
The invention will be further elucidated and described with reference to the drawings and the detailed description.
As shown in fig. 1, the present invention utilizes a music accompaniment automatic generation method based on an encoding-decoding network, including the following steps.
1) For input multi-track (or single-track) music, a coding neural network composed of multiple encoders is trained.
2) For input multi-track (or single-track) music, obtaining the output of the coding neural network; the output is combined with the target accompaniment training decoding neural network to be finally generated.
3) For multi-track (or single-track) music without accompaniment to be processed, generating a music accompaniment according to the coding neural network and the decoding neural network, and synthesizing the obtained music accompaniment and the accompaniment-free music into music containing complete accompaniment.
In one embodiment of the present invention, a process of building an encoding-decoding network structure and a process of preprocessing music data are introduced.
A training process of the encoding neural network and the decoding neural network.
First, a training set of source music is obtained, each music being a training sample, each music training sample including a condition track and a target track, wherein the condition track is a main track and is an input of the model, and the target track is an accompaniment track and is an output of the model.
Secondly, establishing a coding neural network and a decoding neural network, wherein the coding neural network is composed of a plurality of encoders, and the decoding neural network is composed of a plurality of decoders; preferably a Transformer encoder and decoder architecture. However, because the Transformer processes the current fixed-length segment every time and is not suitable for processing long-length music data, on the basis of the Transformer, the segment information processed last time is stored, combined with the current segment information and processed, and better fusion of context information is realized.
The source music training set is then processed.
The source music is read and encoded into different music representations (such as REMI representations or pinanoll-based representations) according to different music formats. Specifically, for different music formats, such as MP3, WAV, etc., music is converted into music sequences by different music reading techniques, and then the music sequences are converted into coding sequences according to different music coding modes, such as MIDI or REMI coding, etc. For example, the MIDI-Like code encodes a music sequence into a code sequence consisting of SET _ release, not _ ON, TIME _ SHIFT, not _ OFF, etc., such as: SET _ VELOCITY <100>, NOTE _ ON <70>, TIME _ SHIFT <500>, NOTE _ ON <74>, TIME _ SHIFT <500>, NOTE _ OFF <70>, NOTE _ OFF <67> … ….
The master track statement (x) of the source music coding representation is then obtained from the source music training set1,…,xM) And the accompaniment track (y) of the source music coding representation sentence1,…,yN) Wherein x isiDenotes the ith symbol in the main track sentence, M denotes the number of symbols in the main track sentence, yiThe ith note in the accompaniment track sentence representing the source music, N represents the number of notes in the accompaniment track, and finally the main track (x) is embedded by words1,…,xM) And accompaniment tracks (y)1,…,yN) Converting into an embedded vector; in practice, because the main track and the accompaniment tracks are long, the present invention equally divides the main track and the accompaniment tracks into a plurality of sections, and only one section is encoded and decoded each time, so that the plurality of sections of accompaniment tracks are finally combined into the whole accompaniment track.
In one embodiment of the present invention, a training process for an encoding-decoding network architecture is presented.
And inputting the embedded vector of the main audio track into a neural network of an encoder to carry out N-step encoding, storing a hidden layer sequence obtained by calculating a symbol positioned in front in a main audio track sentence in the encoder into a cache in the encoding process, introducing data in the cache into the encoding process of a symbol behind, and repeatedly encoding for N times to obtain the output of the encoder as the music representation after the main audio track encoding.
The method comprises the steps of performing N-step decoding on a music representation coded by a main track by taking an embedded vector of an accompaniment track as the input of a decoding neural network, performing mask processing on a cross attention module in a decoder in the decoding process, storing a hidden layer sequence obtained by calculating a symbol positioned in front in an accompaniment track sentence in the decoder into a cache, introducing data in the cache into the decoding process of a symbol behind, repeating the decoding for N times, comparing output results of all decoders with the accompaniment track, and training the coding neural network and the decoding neural network until a trained coding-decoding network model is obtained.
Introducing the data in the buffer memory into the encoding process or the decoding process of the following symbol, specifically:
firstly, connecting data in a cache which is not subjected to gradient calculation with the output of a previous hidden layer, wherein the connected data are respectively used as the input of a K channel and a V channel in a next self-attention layer, and the output of the previous hidden layer is used as the input of a Q channel in the next self-attention layer to obtain the output of the next hidden layer; and the rest is repeated until the N-layer coding or the N-layer decoding is finished.
Further, the encoder encodes each symbol in the conditional track (input single-track or multi-track music) at each step, and during training, the hidden layer sequence calculated from the previous symbol is stored in a buffer and fixed, and this part of the buffer is used as additional context information. The output of the encoder will be the input to the decoder. Formula of encoder
Figure GDA0002718952790000051
Wherein the content of the first and second substances,
Figure GDA0002718952790000052
and representing the code buffer of the ith code. This part of the buffer is calculated in the previous steps. The specific calculation process is as follows: in the training phase, each hidden layer receives two inputs, the output of the previous hidden layer of the segment and the output of the previous hidden layer of the segment, when processing the following segment. These two inputs are concatenated and then used to calculate the key and value of the current segment, as follows:
Figure GDA0002718952790000061
Figure GDA0002718952790000062
Figure GDA0002718952790000063
τ represents the τ th piece of music, n represents the n-th layer of the encoder, h represents the output of the hidden layer, SG (-) represents the stopping gradient calculation, W is the model parameter,
Figure GDA00027189527900000618
is a vector stitching operation.
Figure GDA0002718952790000064
Is the q, k, v vector in the encoder,
Figure GDA0002718952790000065
is the output of the (t + 1) th piece of music on the (n-1) th layer,
Figure GDA0002718952790000066
is the output of the segment # at layer n-1,
Figure GDA0002718952790000067
is the result of the splicing process,
Figure GDA0002718952790000068
is the output of the τ +1 th piece of music at the n level.
The encoder employed in the present invention employs relative position coding, i.e., coding based on the relative distance between words rather than the absolute position as in a transform. The calculation of the relative position code is as follows:
Figure GDA0002718952790000069
wherein the content of the first and second substances,
Figure GDA00027189527900000610
embedded vectors of the ith and jth symbols, R, respectivelyi-jIs the relative position vector of the ith and jth symbols, Wk,R、Wk,E、WqIs a parameter matrix, which is a matrix that can be learned; u and v are parameter vectors, which are learnable vectors; t denotes the transpose of the image,
Figure GDA00027189527900000611
is the relative position coding result of the ith symbol and the jth symbol.
The decoder aims to generate the present symbol from the previous symbol and the context semantics from the encoder. During the training process, a mask is applied to the cross-attention module to ensure that each symbol in the decoder sees only the conditional context of the same bar. Calculation formula of decoder
Figure GDA00027189527900000612
Figure GDA00027189527900000613
Representing the decoder buffer, which is the result of the previous steps of the calculation.
Figure GDA00027189527900000614
Stands for smallAll encoder outputs within a section. The specific calculation process is as follows: in the training phase, each hidden layer receives two inputs, the output of the previous hidden layer of the segment and the output of the previous hidden layer of the segment, when processing the following segment. These two inputs are concatenated and then used to calculate the key and value of the current segment, as follows:
Figure GDA00027189527900000615
Figure GDA00027189527900000616
Figure GDA00027189527900000617
τ represents the τ th piece of music, n represents the n-th layer of the decoder, h represents the output of the hidden layer, SG (-) represents the stopped gradient calculation, W is the model parameter,
Figure GDA0002718952790000078
is a vector stitching operation.
Figure GDA0002718952790000071
Are the q, k, v vectors in the decoder,
Figure GDA0002718952790000072
is the output of the (t + 1) th piece of music on the (n-1) th layer,
Figure GDA0002718952790000073
is the output of the segment # at layer n-1,
Figure GDA0002718952790000074
is the result of the splicing process,
Figure GDA0002718952790000075
is the output of the τ +1 th piece of music at the n level.
The decoder also uses the concept of relative position coding, i.e. coding based on the relative distance between words rather than the absolute position as in a transform. The calculation of the relative position code is as follows:
Figure GDA0002718952790000076
wherein the content of the first and second substances,
Figure GDA0002718952790000077
is an embedded vector of symbols i, j, Ri-jIs a relative position vector and the other parameters are learnable vectors or matrices.
In another embodiment of the present invention, a music accompaniment automatic generation system based on an encoding-decoding network is provided.
The method specifically comprises the following steps:
music sample collection module: the music data acquisition device is used for acquiring music data as a source music training set or to-be-processed accompaniment-free music, wherein the source music training set comprises condition music tracks and target music tracks, the condition music tracks are marked as main music tracks, and the target music tracks are marked as accompaniment music tracks.
Encoding-decoding network module: the system is provided with an encoding neural network and a decoding neural network, wherein the encoding neural network is composed of a plurality of encoders, and the decoding neural network is composed of a plurality of decoders.
The music format preprocessing module: different music preprocessing modes are configured, and the corresponding preprocessing modes are selected by recognizing different music formats.
A cache module: for buffering the hidden layer output information of each encoder and decoder.
A training module: for updating the parameters of the coding-decoding network module during the training phase.
The music synthesis module: the music accompaniment and the unaccompanied music are synthesized into music containing complete accompaniment and output.
Wherein, the music format preprocessing module comprises:
an initial characterization generation module: for reading source music and encoding into different initial music representations.
A word embedding module: for converting the statements of the main track and the accompaniment track into an embedded vector.
The coding neural network in the coding-decoding network module is composed of a plurality of coders, in the coding process, a hidden layer sequence obtained by calculating a symbol positioned in front in a main track sentence in the coders is stored in a cache module, and data in the cache is introduced into the coding process of a symbol behind, and coding is repeated for N times.
The decoding neural network in the coding-decoding network module is composed of a plurality of decoders, in the decoding process, mask processing is carried out on a cross attention module in the decoders, a hidden layer sequence obtained by calculating a symbol positioned in the front in an accompaniment track sentence in the decoder is stored in a cache module, data in the cache is introduced into the decoding process of a symbol positioned in the rear, and decoding is repeated for N times.
The modules are communicatively connected through some interfaces, and may be in an electrical or other form. The division of modules is not limited to the specific embodiments provided herein.
The method is applied to the following embodiments to achieve the technical effects of the present invention, and detailed steps in the embodiments are not described again.
Examples
To further demonstrate the effectiveness of the present invention, the present invention was experimentally verified on the LMD data set, which contains 21916 music pieces, 372339 bars, and a total duration of 255.13 hours. The music coding adopted by the invention is MIDI-Like music coding based on event sequences. In order to verify the effectiveness of the invention, a plurality of evaluation indexes are designed through experiments, and objective evaluation comprises Chord Accuracy (CA) and average harmonic height overlapping area (D)p) Volume average overlap region (D)v) Time-averaged overlap region (D)d)。
TABLE 1 results of the experiment
CA Dp Dd
Method for producing a composite material 0.45±0.01 0.58±0.01 0.55±0.01
MuseGAN 0.37±0.02 0.21±0.01 0.35±0.01
Table 1 shows the evaluation results of the present invention, where MuseGAN is a method based on a piano rolling graph, and it can be seen that the results of the present invention are greatly superior to those of MuseGAN, which indicates that the present invention has achieved a breakthrough success in creatively applying the encoding-decoding structure to the field of automatic music accompaniment.

Claims (6)

1. A music accompaniment automatic generation method based on a coding-decoding network, characterized by comprising the following steps:
1) acquiring a source music training set, wherein each piece of music is used as a training sample, and the training samples are marked as a main track and an accompaniment track;
2) establishing a coding-decoding network structure which comprises a coding neural network and a decoding neural network, wherein the coding neural network is composed of a plurality of coders, and the decoding neural network is composed of a plurality of decoders; the encoder and the decoder adopt the structures of the encoder and the decoder in a Transformer;
3) reading source music and coding the source music into different initial music representations according to different music formats to obtain source music coding representations; then, acquiring a main track statement and an accompaniment track statement from the source music coding representation; finally, converting the main track sentences and the accompaniment track sentences into embedded vectors through word embedding;
4) inputting the embedded vector of the main audio track into a neural network of an encoder to carry out N-step encoding, storing a hidden layer sequence obtained by calculating a symbol positioned in front in a main audio track sentence in the encoder into a cache in the encoding process, introducing data in the cache into the encoding process of a symbol behind, firstly connecting data in the cache without gradient calculation with the output of a previous hidden layer, respectively taking the connected data as the input of a K channel and a V channel in a next self-attention layer, and taking the output of the previous hidden layer as the input of a Q channel in the next self-attention layer to obtain the output of the next hidden layer; repeating the steps until N layers of coding are completed; after repeating the encoding for N times, obtaining the output of the encoder as the music representation after the main track encoding;
the method comprises the steps that a music representation after main track coding and an embedded vector of an accompaniment track are used as input of a decoding neural network to carry out N-step decoding, in the decoding process, mask processing is carried out on a cross attention module in a decoder, a hidden layer sequence obtained by calculation of a symbol positioned in front in an accompaniment track sentence in the decoder is stored in a cache, data in the cache is introduced into the decoding process of a symbol behind, firstly, data in the cache without gradient calculation are connected with output of a previous hidden layer, the connected data are respectively used as input of a K channel and a V channel in a next layer of self-attention layer, the output of the previous layer of hidden layer is used as input of a Q channel in the next layer of self-attention layer, and output of the next layer of hidden layer is obtained; repeating the steps until the decoding of the N layers is completed; after repeating decoding for N times, comparing output results of all decoders with accompaniment tracks, and training a coding neural network and a decoding neural network until a trained coding-decoding network model is obtained;
5) acquiring an embedded vector of a music main audio track aiming at the music without accompaniment to be processed, taking the embedded vector as the input of a trained coding-decoding network model, and outputting the music accompaniment; and synthesizing the obtained musical accompaniment and the unaccompanied music into music containing complete accompaniment.
2. The method according to claim 1, wherein the initial music representation obtained in step 3) is obtained by: firstly, according to different music formats, a corresponding music reading technology is adopted to obtain a music sequence, and then MIDI or REMI coding is adopted to obtain an encoded music sequence which is used as an initial music representation.
3. The method as claimed in claim 1, wherein the encoded neural network employs a relative position encoding method.
4. The method of claim 3, wherein the calculation formula of the relative position coding mode is:
Figure DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 824416DEST_PATH_IMAGE002
respectively the embedded vectors of the ith and jth symbols,
Figure 407319DEST_PATH_IMAGE003
is the relative position vector of the ith symbol and the jth symbol,
Figure 151284DEST_PATH_IMAGE004
is a matrix of parameters that is,
Figure 388231DEST_PATH_IMAGE005
and
Figure 192238DEST_PATH_IMAGE006
in the form of a vector of parameters,
Figure 530947DEST_PATH_IMAGE007
the transpose is represented by,
Figure 78603DEST_PATH_IMAGE008
is the relative position coding result of the ith symbol and the jth symbol.
5. An automatic generation system of musical accompaniment for implementing the method of claim 1, comprising:
music sample collection module: the music data acquisition device is used for acquiring music data as a source music training set or to-be-processed accompaniment-free music, wherein the source music training set comprises a condition audio track and a target audio track, the condition audio track is marked as a main audio track, and the target audio track is marked as an accompaniment audio track;
encoding-decoding network module: configuring a coding neural network and a decoding neural network, wherein the coding neural network is composed of a plurality of encoders, in the coding process, a hidden layer sequence obtained by calculating a symbol positioned in front in a main track sentence in the encoders is stored in a cache module, and data in the cache is introduced into the coding process of a symbol behind, and the coding is repeated for N times; the decoding neural network is composed of a plurality of decoders, in the decoding process, mask processing is carried out on a cross attention module in the decoders, a hidden layer sequence obtained by calculating a symbol positioned in front in an accompaniment track sentence in the decoder is stored in a cache module, data in the cache is introduced into the decoding process of a symbol behind, and decoding is repeated for N times;
the music format preprocessing module: configuring different music preprocessing modes, and selecting the corresponding preprocessing modes by identifying different music formats;
a cache module: for caching hidden layer output information of each encoder and decoder;
a training module: for updating the parameters of the coding-decoding network module during the training phase;
the music synthesis module: the music accompaniment and the unaccompanied music are synthesized into music containing complete accompaniment and output.
6. The system for automatic generation of musical accompaniment according to claim 5, wherein said music format preprocessing module comprises:
an initial characterization generation module: for reading source music and encoding into different initial music representations;
a word embedding module: for converting the statements of the main track and the accompaniment track into an embedded vector.
CN202010795908.1A 2020-08-10 2020-08-10 Music accompaniment automatic generation method and system based on coding-decoding network Active CN111653256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010795908.1A CN111653256B (en) 2020-08-10 2020-08-10 Music accompaniment automatic generation method and system based on coding-decoding network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010795908.1A CN111653256B (en) 2020-08-10 2020-08-10 Music accompaniment automatic generation method and system based on coding-decoding network

Publications (2)

Publication Number Publication Date
CN111653256A CN111653256A (en) 2020-09-11
CN111653256B true CN111653256B (en) 2020-12-08

Family

ID=72350277

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010795908.1A Active CN111653256B (en) 2020-08-10 2020-08-10 Music accompaniment automatic generation method and system based on coding-decoding network

Country Status (1)

Country Link
CN (1) CN111653256B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112528631B (en) * 2020-12-03 2022-08-09 上海谷均教育科技有限公司 Intelligent accompaniment system based on deep learning algorithm
CN113223482A (en) * 2021-04-07 2021-08-06 北京脑陆科技有限公司 Music generation method and system based on neural network
CN114171053B (en) * 2021-12-20 2024-04-05 Oppo广东移动通信有限公司 Training method of neural network, audio separation method, device and equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101231844B (en) * 2008-02-28 2012-04-18 北京中星微电子有限公司 System and method of mobile phone ring mixing
JP6047985B2 (en) * 2012-07-31 2016-12-21 ヤマハ株式会社 Accompaniment progression generator and program
US8847056B2 (en) * 2012-10-19 2014-09-30 Sing Trix Llc Vocal processing with accompaniment music input
CN111091800B (en) * 2019-12-25 2022-09-16 北京百度网讯科技有限公司 Song generation method and device
CN111161695B (en) * 2019-12-26 2022-11-04 北京百度网讯科技有限公司 Song generation method and device

Also Published As

Publication number Publication date
CN111653256A (en) 2020-09-11

Similar Documents

Publication Publication Date Title
Dhariwal et al. Jukebox: A generative model for music
CN111653256B (en) Music accompaniment automatic generation method and system based on coding-decoding network
Mor et al. A universal music translation network
Roberts et al. Hierarchical variational autoencoders for music
US5930754A (en) Method, device and article of manufacture for neural-network based orthography-phonetics transformation
CN111554255B (en) MIDI playing style automatic conversion system based on recurrent neural network
Kim et al. Korean singing voice synthesis system based on an LSTM recurrent neural network
Wang et al. A Vector Quantized Variational Autoencoder (VQ-VAE) Autoregressive Neural $ F_0 $ Model for Statistical Parametric Speech Synthesis
CN111583891B (en) Automatic musical note vector composing system and method based on context information
Wang et al. PerformanceNet: Score-to-audio music generation with multi-band convolutional residual network
CN113327627B (en) Multi-factor controllable voice conversion method and system based on feature decoupling
Lin et al. A unified model for zero-shot music source separation, transcription and synthesis
CN114242033A (en) Speech synthesis method, apparatus, device, storage medium and program product
CN110459201B (en) Speech synthesis method for generating new tone
CN113035228A (en) Acoustic feature extraction method, device, equipment and storage medium
Shin et al. Text-driven emotional style control and cross-speaker style transfer in neural tts
Zhang et al. AccentSpeech: learning accent from crowd-sourced data for target speaker TTS with accents
Sajad et al. Music generation for novices using Recurrent Neural Network (RNN)
ES2366551T3 (en) CODING AND DECODING DEPENDENT ON A SOURCE OF MULTIPLE CODE BOOKS.
Maduskar et al. Music generation using deep generative modelling
Cooper et al. Text-to-speech synthesis techniques for MIDI-to-audio synthesis
CN116386575A (en) Music generation method, device, electronic equipment and storage medium
CN112820266B (en) Parallel end-to-end speech synthesis method based on skip encoder
Tomczak et al. Drum translation for timbral and rhythmic transformation
CN113299268A (en) Speech synthesis method based on stream generation model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant