WO2021236209A1 - Automatic music generation - Google Patents

Automatic music generation Download PDF

Info

Publication number
WO2021236209A1
WO2021236209A1 PCT/US2021/022025 US2021022025W WO2021236209A1 WO 2021236209 A1 WO2021236209 A1 WO 2021236209A1 US 2021022025 W US2021022025 W US 2021022025W WO 2021236209 A1 WO2021236209 A1 WO 2021236209A1
Authority
WO
WIPO (PCT)
Prior art keywords
note
sequence
time
music
transformer
Prior art date
Application number
PCT/US2021/022025
Other languages
English (en)
French (fr)
Inventor
Xianchao WU
Chengyuan Wang
Qinying LEI
Peijun XIA
Yuanchun XU
Original Assignee
Microsoft Technology Licensing, Llc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Technology Licensing, Llc filed Critical Microsoft Technology Licensing, Llc
Publication of WO2021236209A1 publication Critical patent/WO2021236209A1/en

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • G10H2210/111Automatic composing, i.e. using predefined musical rules
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • G10H2210/131Morphing, i.e. transformation of a musical piece into a new different one, e.g. remix
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Definitions

  • Embodiments of the present disclosure propose methods and apparatuses for automatic music generation.
  • An initial sequence may be obtained.
  • a time-valued note sequence may be generated through a transformer network.
  • the time-valued note sequence may be converted to music content.
  • FIG.1 illustrates an exemplary process of automatic music generation according to an embodiment.
  • FIG.2 illustrates an example of time value-based note representation according to an embodiment.
  • FIG.3 illustrates an exemplary architecture of a Transformer-XL.
  • FIG.4 illustrates an exemplary architecture of a transformer network according to an embodiment
  • FIG.5 illustrates an exemplary architecture of a transformer network according to an embodiment.
  • FIG.6 illustrates an exemplary process of updating music content according to an embodiment.
  • FIG.7 illustrates a flow of an exemplary method for automatic music generation according to an embodiment.
  • FIG.8 illustrates an exemplary apparatus for automatic music generation according to an embodiment.
  • FIG.9 illustrates an exemplary apparatus for automatic music generation according to an embodiment. DETAILED DESCRIPTION [0015]
  • the present disclosure will now be discussed with reference to several example implementations. It is to be understood that these implementations are discussed only for enabling those skilled in the art to better understand and thus implement the embodiments of the present disclosure, rather than suggesting any limitations on the scope of the present disclosure.
  • the mainstream approaches include modeling note sequences by borrowing the idea of language modeling in the field of natural language processing, to predict a note sequence during a music generation process.
  • DeepJ is proposed for style-specific music generation, which may train a bidirectional long short term memory (LSTM) network with piano roll note representations for music styles such as the Baroque period, the Classical period, the Romantic period, etc. DeepJ can create music conditioned on a specific mixture of composition styles.
  • LSTM long short term memory
  • piano roll representations of notes are used as dense representations of MIDI music. It is assumed that a piece of music is an N ⁇ T matrix, wherein N is the number of playable notes, and T is the number of time steps.
  • note-playing and note- replaying are jointly used to define a note representation, wherein the note-replaying refers to replaying a note immediately after the note ends, with no time step between successive playing.
  • note-replaying refers to replaying a note immediately after the note ends, with no time step between successive playing.
  • piano roll is not dense actually, because there are a large number of zeros in the playing/replaying matrix, and during each time step only a few notes are played and all other notes are zeros.
  • Transformer technique has also been proposed for music generation, wherein the transformer is a sequence model that is based on self-attention mechanism, which has good performance in tasks that require maintaining long-range coherence.
  • RNN recurrent neural network
  • the transformer is more parallelizable and more interpretable for both training and inferring stages.
  • Music Transformer is proposed for adopting a transformer with a relative attention mechanism for generating piano music.
  • the Music Transformer is trained by adopting a single note sequence with time intervals. It will be difficult to use representation of notes that are based on time or time interval for calculating similarity of notes that share the same time value but have different tempos in one or more MIDI files.
  • the existing Transformer-XL for music generation is still trained by adopting a single note sequence with time intervals, thus facing similar restrictions as the Music Transformer described above.
  • Automatic music generation proposed by the embodiments of the present disclosure may quickly and automatically create high-quality music and generate corresponding music content.
  • the music content may broadly refer to various instantiated presentations of music, e.g., MIDI file, music score, etc.
  • MIDI file e.g., MIDI file, music score, etc.
  • the embodiments of the present disclosure may be applied to automatically generate various other types of music content in similar approaches.
  • the embodiments of the present disclosure propose time value-based note representation.
  • the time value-based note representation is a type of relative note representation, which measures a note in a relative length, rather than representing a note in time.
  • a time-valued note sequence may be adopted in the automatic music generation.
  • Each note in the time-valued note sequence is represented by a 4-tuple (four-tuple).
  • the 4-tuple may comprise: a time value from the former note on to the current note on; a time value from the current note on to the current note off; a pitch of the current note; and a velocity of the current note.
  • the time-valued note sequence may be further divided into four sequences: a sequence corresponding to the time value from the former note on to the current note on, simplified as an On2On sequence; a sequence corresponding to the time value from the current note on to the current note off, simplified as an On2Off sequence; a sequence corresponding to the pitch of the current note, simplified as a pitch sequence; and a sequence corresponding to the velocity of the current note, simplified as a velocity sequence.
  • the embodiments of the present disclosure propose a transformer network constructed based at least on Transformer-XL, for predicting time-valued note sequences.
  • the transformer network may comprise four transformer sub-networks constructed based on Transformer-XL respectively.
  • the four transformer sub-networks may correspond to a On2On sequence, a On2Off sequence, a pitch sequence, and a velocity sequence included in a time-valued note sequence respectively, and may be jointly trained based on these four sequences.
  • the embodiments of the present disclosure may perform music generation under generation conditions specified by a user.
  • the user may specify a desired emotion and/or music style for music to be generated.
  • the transformer network may predict a time-valued note sequence in consideration of the emotion and/or music style.
  • the user may provide an indication of music parameters, so that when the predicted time-valued note sequence is converted into music content, at least the music parameters specified by the user may be considered.
  • the embodiments of the present disclosure propose an updating mechanism for the generated music content. Since humans compose music based on time values of notes, and the embodiments of the present disclosure also employ time value- based note representations, it is easy to understand and modify the generated music content.
  • the generated music content may be post-edited or post-created by a user through a specific music content editing platform, thereby implementing the updating of the music content.
  • the embodiments of the present disclosure may quickly and automatically generate music content with higher quality and longer duration. Rhythm of the generated music is stable and rich. Note density of the generated music is stable over a long period of time without significant decay over time, wherein the note density may refer to, e.g., the number of notes in each time window size.
  • FIG.1 illustrates an exemplary process 100 of automatic music generation according to an embodiment.
  • An initial sequence 102 may be obtained first.
  • the initial sequence 102 is a seed sequence used to trigger automatic music generation, which may take the form of time-valued note sequence.
  • the initial sequence 102 may be a short segment of time-valued note sequence as the beginning of the music to be generated.
  • the initial sequence 102 may represent notes based on the 4-tuple as described above.
  • the initial sequence may be obtained in various approaches.
  • the initial sequence 102 may be randomly generated. In one implementation, the designated initial sequence 102 may be received directly from, e.g., a user. In one implementation, the initial sequence 102 may be obtained indirectly. For example, assuming that a music segment is obtained and it is desired to use the music segment as the beginning of the generated music, after receiving the music segment, a time-valued note sequence may be extracted from the music segment as the initial sequence 102. [0028] The initial sequence 102 may be provided to a transformer network 110 to trigger prediction of a time-valued note sequence.
  • the transformer network 110 may include four transformer sub-networks corresponding to 4-tuples used to represent notes, e.g., a sub-network for processing an On2On sequence, a sub-network for processing an On2Off sequence, a sub-network for processing a pitch sequence, a sub-network for processing a velocity sequence, etc. Each sub-network may be constructed based on Transformer-XL.
  • the transformer network 110 may iteratively predict the next note based on the current time-valued note sequence. [0029]
  • the transformer network 110 may also predict the time-valued note sequence in consideration of emotion and/or music style. For example, an indication of emotion and/or music style 104 may be received.
  • the emotion may refer to a type of emotion that is expected to be expressed by the generated music, eg happy sad etc
  • the music style may refer to a style to which the generated music belongs. There may be various classifications of music style in the field of music, e.g., Baroque, Classical, Romantic, Country, jazz, etc. The embodiments of the present disclosure are applicable to any music style.
  • the emotion and/or music style 104 may be converted into a corresponding latent spatial representation, e.g., an emotion embedding representation and/or style embedding representation 106, and the emotion embedding representation and/or style embedding representation 106 are provided to the transformer network 110.
  • the transformer network 110 may take the emotion embedding representation and/or style embedding representation 106 as an additional input in addition to the current time-valued note sequence.
  • the process 100 may further convert the time-valued note sequence 120 into music content 140.
  • the conversion may be performed by a converting module 130.
  • the converting module 130 may operate based on a predetermined rule 132.
  • the predetermined rule 132 includes a mapping relationship from a time-valued note sequence to a specific type of music content.
  • the predetermined rule 132 may include a predetermined mapping relationship from a time-valued note sequence to a MIDI file.
  • the converting module 130 may map information read from the time-valued note sequence 120 to information in a MIDI file based on the predetermined rule 132, thereby finally outputting the MIDI file.
  • Different types of music content may correspond to different predetermined rules, and the embodiments of the present disclosure are not restricted by any specific predetermined rules.
  • the converting module 130 may also perform conversion with reference to music parameters 108.
  • the music parameters 108 may include various parameters involved in music creation, e.g., tempo, metre, length, etc.
  • the tempo may refer to the rate of beats, which may be denoted as a bpm value, i.e., the number of beats per minute.
  • the metre may refer to a fixed repetitive sequence of strong and weak sounds, which may be denoted by, e.g., a metre number, such as, 2/4, 4/4, etc.
  • the length may refer to the desired total length of music. It should be understood that only a few examples of music parameters are listed above, and the embodiments of the present disclosure are also applicable to any other music parameters.
  • the music parameters 108 may be specified by a user according to the user's own preferences, or set by default.
  • the process 100 may receive an indication of the music parameters and provide it to the converting module 130
  • the converting module 130 may make reference to the music parameters 108 when converting the time-valued note sequence 120 into the music content 140. For example, assuming that the music parameters specify a metre of 4/4, the converting module 130 may convert the time-valued note sequence 120 into, e.g., a MIDI file according to the metre "4/4".
  • the above process 100 is only exemplary, and depending on specific application requirements and designs, any form of modification may be made to the process 100 and all these modifications will be covered by the present disclosure.
  • FIG.2 illustrates an example of time value-based note representation according to an embodiment.
  • An exemplary music segment is shown in the diagram 200.
  • the x-axis denotes time and the y-axis denotes pitch.
  • This music segment is a C major chord arpeggiated with the sustain pedal being active, and a starting velocity is 80.
  • a note 202, a note 204, and a note 206 are sequentially played at second 0, second 0.5, and second 1.0, wherein a pitch of the note 202 is 60, a pitch of the note 204 is 64, and a pitch of the note 206 is 67.
  • the sustain pedal is released, and the note 202, the note 204, and the note 206 end.
  • FIG.2 shows a note sequence 210 that represents the above music segment in the approach of time shift-based note representation in Music Transformer.
  • Four performance events for note representation are defined in Music Transformer, including: "SET_VELOCITY”, which denotes the setting of the velocity of the note; "NOTE_ON”, which denotes the start of the note; "TIME_SHIFT”, which denotes the time interval from the previous event to the next event; and "NOTE_OFF”, which denotes the end of the note. Therefore, the note sequence 210 may also be regarded as a sequence of performance events.
  • in the note sequence 200 represents that: at 1000 milliseconds after the previous event "NOTE ON ⁇ 67>" (i.e., at second 20) the note 202 with the pitch of "60” the note 204 with the pitch of "64", and the note 206 with the pitch of "67” all end.
  • "NOTE ON ⁇ 67>" i.e., at second 20
  • TIME_SHIFT ⁇ 500>, SET_VELOCITY ⁇ 100>, NOTE_ON ⁇ 65>" in the note sequence 200 represents that: at 500 milliseconds after the previous event "NOTE_OFF ⁇ 67>" (i.e., at second 2.5), a velocity of "100” is set for the subsequent notes, and the note 208 with a pitch of "67" starts.
  • “TIME_SHIFT” adopted in Music Transformer may cause the loss of time value information.
  • FIG.2 shows a time-valued note sequence 220 that represents the music segment in the approach of time value-based note representation according to the embodiments of the present disclosure.
  • each note is represented by a 4-tuple.
  • a 4-tuple “(0, 1.0, 60, 80)” represents the note 202
  • a 4-tuple “(0.25, 0.75, 64, 80)” represents the note 204
  • a 4-tuple “(0.25, 0.5, 67, 80)” represents the note 206
  • a four-tuple “(0.75, 0.25, 65, 100)” represents the note 208.
  • a time value may be a ratio of a target object to a whole note, the target object being eg “from the former note on to the current note on” “from the current note on to the current note off", etc.
  • time value (tempo ⁇ duration)/(60 ⁇ 4)
  • the tempo denotes a bpm value
  • the duration denotes how many seconds the target object lasted
  • 60 denotes 60 seconds which is for converting a tempo measured in minutes to a tempo measured in seconds
  • "divided by 4" is for converting into a ratio relative to a whole note.
  • a time-valued note sequence 230 for representing the music segment is also shown in FIG.2, which is a variant of the time-valued note sequence 220.
  • the time-valued note sequence 220 employs floating-point representations of time values, e.g., 1.0, 0.25, etc., while the time-valued note sequence 230 employs integer representations of time values.
  • An integer representation of a time value may be obtained through multiplying a floating-point representation of the time value by an integerization multiple.
  • an integerization multiple of 384 may be calculated, accordingly, the time value "1.0" in the time-valued note sequence 220 may be converted to a time value "384" in the time-valued note sequence 230, the time value "0.25" in the time-valued note sequence 220 may be converted into a time value "96" in the time-valued note sequence 230, etc.
  • the efficiency of data calculation and processing may be further improved.
  • the time-valued note sequences 220 and 230 in FIG.2 may be further divided or converted into an On2On sequence, an On2Off sequence, a pitch sequence and a velocity sequence.
  • the On2On sequence may be formed by the first item in a 4-tuple of each note in the time-valued note sequence
  • the On2Off sequence may be formed by the second item in a 4-tuple of each note in the time-valued note sequence
  • the pitch sequence may be formed by the third item in a 4-tuple of each note in the time-valued note sequence
  • the velocity sequence may be formed by the fourth item in a 4-tuple of each note in the time-valued note sequence.
  • the time-valued note sequence 220 may be divided into an On2On sequence ⁇ 0, 0.25, 0.25, 0.75 ⁇ , an On2Off sequence ⁇ 1.0, 0.75, 0.5, 0.25 ⁇ , a pitch sequence ⁇ 60, 64, 67, 65 ⁇ , and a velocity sequence ⁇ 80, 80, 80, 100 ⁇ .
  • the order of the four items in the 4-tuples listed above is exemplary, and the embodiments of the present disclosure may cover any other arranging order of the four items.
  • the transformer network according to the embodiments of the present disclosure may be constructed based in part on the known Transformer-XL.
  • the Transformer-XL may capture long-term dependency and resolve the problem of language or music context fragmentation through segment-level recurrence and relative position encoding.
  • S ⁇ -1 [ x ⁇ -1,1 , ... , x ⁇ -1,L ] is a segment with a length of L, e.g., L words in natural language or L notes in music, and is the (n-1)-th layer hidden state sequence corresponding to S ⁇ -1 , wherein d is the dimension of hidden layers, n ⁇ N .
  • Equation (1) Equation (2) Equation (3)
  • SG( ⁇ ) denotes stop gradient, i.e., the gradient of will not be updated based on the next segment
  • (h u ⁇ h v ) denotes a concatenation of two hidden sequences along the length dimension
  • W denotes trainable model parameters
  • Transformer-Layer( ⁇ ) denotes processing through the layers in the transformer.
  • the main update here includes utilizing the hidden state sequence in the (n-1)-th layer of the former segment for calculating of the intermediate sequence and further utilizing for calculating the extended context enhanced sequences and to be retrieved with the query sequence [0045]
  • the recurrent mechanism applied by the Transformer-XL shown in the above Equations is similar to which is performed for every two consecutive segments, wherein Recurrent( ⁇ ) denotes the recurrent mechanism. This actually creates a segment-level recurrence in the hidden states of various transformer layers. Thus, the context information is allowed to be applied beyond the two segments.
  • a query vector q i ( E xi + U i ) is the sum of an embedding vector E xi and the i-th absolute position encoding U i
  • a key vector k j ( E xj + U j ) is the sum of an embedding vector E xj and the j-th absolute position encoding U j
  • the attention score between the query vector and the key vector in the same segment may be decomposed as: Equation (4) wherein "abs" is an abbreviation for absolute position encoding.
  • a relative distance R i-j is introduced into the Transformer-XL to describe the relative position embedding between q i and k j .
  • R is a sinusoid encoding matrix without learnable parameters.
  • the relative attention score may be calculated as: Equation (5) wherein "rel” is an abbreviation for relative position encoding, two trainable vectors u , v ⁇ R d are used for replacing and are used for multiplying with W k,E E xj and W k,R R i-j , respectively.
  • Equation (3) by adopting the relative position encoding mechanism may be calculated as: Equation (6) Equation (7) Equation (8) Equation (9) Equation (10) wherein Masked-Softmax( ⁇ ) denotes performing mask and Softmax processing, Linear( ⁇ ) denotes performing linear transformation, LayerNorm( ⁇ ) denotes processing of regularization layer, and Positionwise-Feed-Forward( ⁇ ) denotes performing position feed forward.
  • FIG.3 illustrates an exemplary architecture 300 of a Transformer-XL.
  • An input 302 may be passed to an embedding layer 310 to obtain an embedding representation.
  • relative position encoding may be performed.
  • An added output at 322 may be provided to a memory sensitive module 330 of Transformer- XL.
  • the module 330 may be repeated N times.
  • Each module 330 may further include: a module 332 of masked relative multi-head attention with memory, which may be based on, e.g., the processing of Equations (1), (2), (5), (6), (7), etc.; an addition and regularization layer 334, which may be based on, e.g., the processing of Equation (8); a feed forward layer 336, which may be based on, e.g., the processing of Equation (9); and an addition and regularization layer 338, which may be based on, e.g., the processing of Equation (10).
  • An output of the N modules 330 may be provided to a linear layer 340 to perform linear mapping.
  • An output of the linear layer 340 may be provided to a Softmax layer 350 to obtain a probability 304 of a predicted sequence.
  • the transformer network according to the embodiments of the present disclosure may be constructed based at least on, e.g., the Transformer-XL shown in FIG.3.
  • FIG.4 illustrates an exemplary architecture 400 of a transformer network according to an embodiment.
  • the transformer network may include four transformer sub-networks constructed based on Transformer-XL respectively.
  • the four transformer sub-networks may process an On2On sequence 420, an On2Off sequence 430, a pitch sequence 440, and a velocity sequence 450 included in a time-valued note sequence 410 respectively.
  • an embedding representation of the On2On sequence may be obtained through an embedding layer 421 first.
  • the embedding representation of the On2On sequence may be further passed through a relative position encoding at 424, N memory sensitive modules 425, a linear layer 426, a Softmax layer 427, etc.
  • the relative position encoding at 424 may correspond to the relative position encoding at 320 in FIG3
  • the memory sensitive module 425 may correspond to the module 330 in FIG.3
  • the linear layer 426 may correspond to the linear layer 340 in FIG.3
  • the Softmax layer 427 may correspond to the Softmax layer 350 in FIG.3.
  • the Softmax layer 427 may output probabilities of On2On candidate time values for the next note.
  • An On2On candidate time value with the highest probability may be selected as an On2On time value of the next note.
  • an On2On time value of the next note may also be randomly selected from a plurality of On2On candidate time values with highest-ranked probabilities.
  • an embedding representation of the On2Off sequence may be obtained through an embedding layer 431 first.
  • the embedding representation of the On2Off sequence may be further passed through a relative position encoding at 434, N memory sensitive modules 435, a linear layer 436, a Softmax layer 437, etc.
  • the relative position encoding at 434 may correspond to the relative position encoding at 320 in FIG.3
  • the memory sensitive module 435 may correspond to the module 330 in FIG.3
  • the linear layer 436 may correspond to the linear layer 340 in FIG.3
  • the Softmax layer 437 may correspond to the Softmax layer 350 in FIG.3.
  • the Softmax layer 437 may output probabilities of On2Off candidate time values for the next note.
  • An On2Off candidate time value with the highest probability may be selected as an On2Off time value of the next note.
  • an On2Off time value of the next note may also be randomly selected from a plurality of On2Off candidate time values with highest-ranked probabilities.
  • an embedding representation of the pitch sequence may be obtained through an embedding layer 441 first.
  • At least one of the On2On sequence and the On2Off sequence may also be input to the transformer sub- network corresponding to the pitch sequence 440, so that a pitch of the next note may be predicted under the impact of the at least one of the On2On sequence and the On2Off sequence.
  • the embedding representation of the pitch sequence may be concatenated with an impact factor 414, wherein the impact factor 414 may include the embedding representation of the On2On sequence obtained through the embedding layer 421 and/or the embedding representation of the On2Off sequence obtained through the embedding layer 431.
  • the relative position encoding at 444 may correspond to the relative position encoding at 320 in FIG.3, the memory sensitive module 445 may correspond to the module 330 in FIG.3, the linear layer 446 may correspond to the linear layer 340 in FIG.3, and the Softmax layer 447 may correspond to the Softmax layer 350 in FIG.3.
  • the Softmax layer 447 may output probabilities of candidate pitches for the next note. A candidate pitch with the highest probability may be selected as a pitch of the next note. Optionally, a pitch of the next note may also be randomly selected from a plurality of candidate pitches with highest-ranked probabilities. [0057] In a transformer sub-network corresponding to the velocity sequence 450, an embedding representation of the velocity sequence may be obtained through an embedding layer 451 first.
  • At least one of the On2On sequence and the On2Off sequence may also be input to the transformer sub- network corresponding to the pitch sequence 450, so that a velocity of the next note may be predicted under the impact of the at least one of the On2On sequence and the On2Off sequence.
  • the pitch sequence may also be input to the transformer sub-network corresponding to the velocity sequence 450, so that a velocity of the next note may be predicted further under the impact of the pitch sequence.
  • the embedding representation of the velocity sequence may be concatenated with an impact factor 415, wherein the impact factor 415 may include the embedding representation of the On2On sequence obtained through the embedding layer 421 and/or the embedding representation of the On2Off sequence obtained through the embedding layer 431, optionally, the impact factor 415 may also include the embedding representation of the pitch sequence obtained through the embedding layer 441. It should be understood that the combination of multiple embedding representations may also be performed at 452 in any other approaches.
  • dimension transformation may be performed on an output of the concatenation at 452 through a linear layer 453.
  • An output of the linear layer 453 may be further passed through a relative position encoding at 454, N memory sensitive modules 455, a linear layer 456, a Softmax layer 457, etc.
  • the relative position encoding at 454 may correspond to the relative position encoding at 320 in FIG.3
  • the memory sensitive module 455 may correspond to the module 330 in FIG.3
  • the linear layer 456 may correspond to the linear layer 340 in FIG.3
  • the Softmax layer 457 may correspond to the Softmax layer 350 in FIG 3
  • the Softmax layer 457 may output probabilities of candidate velocities for the next note. A candidate velocity with the highest probability may be selected as a velocity of the next note.
  • a velocity of the next note may also be randomly selected from a plurality of candidate velocities with highest-ranked probabilities.
  • the On2On time value, On2Off time value, pitch, and velocity determined through the four transformer sub-networks may be combined to form a 4-tuple for representing the predicted next note 460.
  • the predicted note 460 may be added to the time-valued note sequence 410 to form an updated time-valued note sequence.
  • the updated time-valued note sequence may be used for predicting the next note through the architecture 400 again. Through iteratively performing predictions based on the architecture 400, a time-valued note sequence corresponding to the music to be generated may be finally obtained.
  • At least one of the On2On sequence and the On2Off sequence is input to at least one of the transformer sub-network for processing the pitch sequence and the transformer sub-network for processing the velocity sequence, so that predictions of pitch and/or velocity may also consider at least time value information.
  • cross-entropy loss of each transformer sub-network may be calculated, and respective cross-entropy losses of the four transformer sub- networks may be combined into a global loss for performing target loss optimization. Therefore, the four transformer sub-networks not only maintain relative independence, but also have mutual influences.
  • FIG.5 illustrates an exemplary architecture 500 of a transformer network according to an embodiment.
  • the architecture 500 is a variant of the architecture 400 in FIG.4.
  • the same reference numbers correspond to the same components.
  • factors of music style and/or emotion are further considered in the process of predicting a time-valued note sequence based on the transformer network so that the predicted time-valued note sequence may follow a specific music style and/or emotion.
  • a designated music style 502 may be obtained, and a style embedding representation corresponding to the music style 502 may be obtained through an embedding layer 504.
  • a designated emotion 506 may be obtained, and an emotion embedding representation corresponding to the emotion 506 may be obtained through an embedding layer 508. Then, the style embedding representation and/or the emotion embedding representation may be provided to at least one of the four transformer sub- networks.
  • the embedding representation of the On2On sequence may be concatenated with an impact factor 512 at 522, wherein the impact factor 512 may include at least one of the style embedding representation and the emotion embedding representation. Then, dimension transformation may be performed on an output of the concatenation at 522 through an optional linear layer 523, and an output of the linear layer 523 may be provided to subsequent processing.
  • the embedding representation of the On2Off sequence may be concatenated with an impact factor 513 at 532, wherein the impact factor 513 may include at least one of the style embedding representation and the emotion embedding representation.
  • an impact factor 514 may also include at least one of the style embedding representation and the emotion embedding representation.
  • an impact factor 515 may also include at least one of the style embedding representation and the emotion embedding representation.
  • the above architecture 500 is only exemplary, and depending on specific application requirements and designs, any form of modification may be made to the architecture 500, and all these modifications will be covered by the present disclosure.
  • Exemplary architectures of the transformer network according to the embodiments of the present disclosure are discussed in conjunction with FIG4 and FIG.5.
  • the final time-valued note sequence output by the transformer network may be further converted into music content through, e.g., conversion operation performed by the converting module 130 in FIG.1.
  • the embodiments of the present disclosure may support updating the generated music content, for example, since it is easy to recognize and understand the music content generated according to the embodiments of the present disclosure, the generated music content may be easily edited or modified.
  • FIG.6 illustrates an exemplary process 600 of updating music content according to an embodiment.
  • music content 604 is created through performing automatic music generation at 610.
  • the automatic music generation at 610 may be performed according to the embodiments of the present disclosure as described above.
  • the music content 604 may be provided to a music content editing platform 620.
  • the music content editing platform 620 may be application software, a website, etc. that supports operations of presenting, modifying, etc., the music content 604. Assuming that the music content 604 is a MIDI file, the music content editing platform 620 may be, e.g., application software that can create and edit MIDI files.
  • the music content editing platform 620 may include a user interface for interacting with a user. Through the user interface, the music content 604 may be provided and presented to the user, and an adjustment indication 606 from the user for at least a part of the music content may be received. For example, if the user is not satisfied with a part of the music content 604 or desires to modify one or more notes, the user may input the adjustment indication 604 through the user interface.
  • the adjustment indication 604 may comprise modification or setting for various parameters involved in music, wherein the parameters may comprise e.g., tempo, pitch, velocity, etc. [0073] The following takes a MIDI file as an example.
  • the existing Music Transformer or Transformer-XL that is based on a single note sequence represents notes based on time or time intervals, thus in a generated MIDI file, notes cannot be effectively quantized to grids or aligned with grids, therefore, it is difficult for a user to recognize specific notes in the MIDI file and to set MIDI controllers and arrange music.
  • the embodiments of the present disclosure represent notes based on time values, in a generated MIDI file, notes will be properly quantized to grids, thus a user may easily recognize the notes and modify corresponding parameters
  • the adjustment indication 606 for at least one note is received through the music content editing platform 620
  • the note may be adjusted in response to the adjustment indication 606, so as to obtain the adjusted note 632.
  • the adjusted note 632 may be used to replace the original note in the music content 604, so as to form updated music content 608.
  • the adjustment operation that is based on an adjustment indication may be performed iteratively, thereby implementing continuous updating of the generated music content.
  • FIG.7 illustrates a flow of an exemplary method 700 for automatic music generation according to an embodiment.
  • an initial sequence may be obtained.
  • a time-valued note sequence may be generated through a transformer network.
  • the time-valued note sequence may be converted to music content.
  • the initial sequence may be randomly generated, received, or generated according to a music segment.
  • the generating a time-valued note sequence may comprise iteratively performing the operation of: predicting the next note based at least on the current time-valued note sequence through the transformer network.
  • the transformer network may be constructed based at least on Transformer-XL.
  • each note in the time-valued note sequence may be represented by a 4-tuple. The 4-tuple may comprise: a time value from the former note on to the current note on; a time value from the current note on to the current note off; a pitch of the current note; and a velocity of the current note.
  • the time-valued note sequence may comprise: a first sequence corresponding to the time value from the former note on to the current note on; a second sequence corresponding to the time value from the current note on to the current note off; a third sequence corresponding to the pitch of the current note; and a fourth sequence corresponding to the velocity of the current note.
  • the transformer network may comprise: a first transformer sub-network corresponding to the first sequence; a second transformer sub-network corresponding to the second sequence; a third transformer sub-network corresponding to the third sequence; and a fourth transformer sub-network corresponding to the fourth sequence.
  • At least one of the first sequence and the second sequence may be further input to at least one of the third transformer sub-network and the fourth transformer sub-network.
  • the third sequence may be further input to the fourth transformer sub-network.
  • the method 700 may further comprise: receiving an indication of emotion and/or an indication of music style; generating an emotion embedding representation corresponding to the emotion and a style embedding representation corresponding to the music style; and inputting the emotion embedding representation and/or the style embedding representation to at least one of the first transformer sub-network, the second transformer sub-network, the third transformer sub-network and the fourth transformer sub-network.
  • the method 700 may further comprise: receiving an indication of music parameters.
  • the converting the time-valued note sequence to music content may be performed further based on the music parameters.
  • the music parameters may comprise at least one of: tempo, metre, and length.
  • the method 700 may further comprise: receiving an adjustment indication for at least one note in the music content; and in response to the adjustment indication, updating the music content.
  • the music content may be a MIDI file.
  • the method 700 may further comprise any step/process for automatic music generation according to the embodiments of the present disclosure as described above.
  • FIG.8 illustrates an exemplary apparatus 800 for automatic music generation according to an embodiment.
  • the apparatus 800 may comprise: an initial sequence obtaining module 810, for obtaining an initial sequence; a time-valued note sequence generating module 820 for, in response to the initial sequence, generating a time-valued note sequence through a transformer network; and a converting module 830, for converting the time-valued note sequence to music content.
  • the generating a time-valued note sequence may comprise iteratively performing the operation of: predicting the next note based at least on the current time-valued note sequence through the transformer network
  • each note in the time-valued note sequence may be represented by a 4-tuple.
  • the 4-tuple may comprise: a time value from the former note on to the current note on; a time value from the current note on to the current note off; a pitch of the current note; and a velocity of the current note.
  • the time-valued note sequence may comprise: a first sequence corresponding to the time value from the former note on to the current note on; a second sequence corresponding to the time value from the current note on to the current note off; a third sequence corresponding to the pitch of the current note; and a fourth sequence corresponding to the velocity of the current note.
  • the transformer network may comprise: a first transformer sub-network corresponding to the first sequence; a second transformer sub-network corresponding to the second sequence; a third transformer sub-network corresponding to the third sequence; and a fourth transformer sub-network corresponding to the fourth sequence.
  • the apparatus 800 may further comprise any other module that performs steps of the methods for automatic music generation according to the embodiments of the present disclosure as described above.
  • FIG.9 illustrates an exemplary apparatus 900 for automatic music generation according to an embodiment.
  • the apparatus 900 may comprise: at least one processor 910; and a memory 920 storing computer-executable instructions that, when executed, cause the at least one processor 910 to: obtain an initial sequence; in response to the initial sequence, generate a time-valued note sequence through a transformer network; and convert the time-valued note sequence to music content.
  • the processor 910 may further perform any other step/process of the methods for automatic music generation according to the embodiments of the present disclosure as described above.
  • the embodiments of the present disclosure may be embodied in a non- transitory computer-readable medium.
  • the non-transitory computer-readable medium may comprise instructions that, when executed, cause one or more processors to perform any operations of the methods for automatic music generation according to the embodiments of the present disclosure as described above.
  • all the operations in the methods described above are merely exemplary, and the present disclosure is not limited to any operations in the methods or sequence orders of these operations, and should cover all other equivalents under the same or similar concepts .
  • all the modules in the apparatuses described above may be implemented in various approaches. These modules may be implemented as hardware, software, or a combination thereof. Moreover, any of these modules may be further functionally divided into sub-modules or combined together. [00106] Processors have been described in connection with various apparatuses and methods.
  • processors may be implemented using electronic hardware, computer software, or any combination thereof. Whether such processors are implemented as hardware or software will depend upon the particular application and overall design constraints imposed on the system.
  • a processor, any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with a microprocessor, microcontroller, digital signal processor (DSP), a field-programmable gate array (FPGA), a programmable logic device (PLD), a state machine, gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • DSP digital signal processor
  • FPGA field-programmable gate array
  • PLD programmable logic device
  • state machine gated logic, discrete hardware circuits, and other suitable processing components configured to perform the various functions described throughout the present disclosure.
  • processors any portion of a processor, or any combination of processors presented in the present disclosure may be implemented with software being executed by a microprocessor, microcontroller, DSP, or other suitable platform.
  • Software shall be construed broadly to mean instructions, instruction sets, code, code segments, program code, programs, subprograms, software modules, applications, software applications, software packages, routines, subroutines, objects, threads of execution, procedures, functions, etc.
  • the software may reside on a computer-readable medium.
  • a computer-readable medium may include, by way of example, memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • memory such as a magnetic storage device (e.g., hard disk, floppy disk, magnetic strip), an optical disk, a smart card, a flash memory device, random access memory (RAM), read only memory (ROM), programmable ROM (PROM), erasable PROM (EPROM), electrically erasable PROM (EEPROM), a register, or a removable disk.
  • RAM random access memory
  • ROM read only memory
  • PROM programmable ROM
  • EPROM erasable PROM
  • EEPROM electrically erasable PROM
  • register e.g.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)
PCT/US2021/022025 2020-05-18 2021-03-12 Automatic music generation WO2021236209A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010421429.3A CN113689835A (zh) 2020-05-18 2020-05-18 自动音乐生成
CN202010421429.3 2020-05-18

Publications (1)

Publication Number Publication Date
WO2021236209A1 true WO2021236209A1 (en) 2021-11-25

Family

ID=75439465

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2021/022025 WO2021236209A1 (en) 2020-05-18 2021-03-12 Automatic music generation

Country Status (2)

Country Link
CN (1) CN113689835A (zh)
WO (1) WO2021236209A1 (zh)

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2080604B (en) * 1980-07-19 1984-03-07 Rose Alfred George Musical rhythm devices
JPH05341787A (ja) * 1992-06-11 1993-12-24 Roland Corp 自動伴奏装置
JP3192597B2 (ja) * 1996-12-18 2001-07-30 株式会社河合楽器製作所 電子楽器の自動演奏装置
JP3587133B2 (ja) * 2000-06-05 2004-11-10 ヤマハ株式会社 発音長決定方法と装置及び記録媒体
JP6175812B2 (ja) * 2013-03-06 2017-08-09 ヤマハ株式会社 楽音情報処理装置及びプログラム
CN106875929B (zh) * 2015-12-14 2021-01-19 中国科学院深圳先进技术研究院 一种音乐旋律转化方法及系统
CN109448697B (zh) * 2018-10-08 2023-06-02 平安科技(深圳)有限公司 诗词旋律生成方法、电子装置及计算机可读存储介质
CN109545172B (zh) * 2018-12-11 2023-01-24 河南师范大学 一种分离式音符生成方法及装置
CN109584846B (zh) * 2018-12-21 2023-04-14 成都潜在人工智能科技有限公司 一种基于生成对抗网络的旋律生成方法
CN109727590B (zh) * 2018-12-24 2020-09-22 成都嗨翻屋科技有限公司 基于循环神经网络的音乐生成方法及装置

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
ASHISH VASWAN ET AL: "Attention Is All You Need", 31ST CONFERENCE ON NEURAL INFORMATION PROCESSING SYSTEMS (NIPS 2017), 12 June 2017 (2017-06-12), XP055506908, Retrieved from the Internet <URL:https://arxiv.org/pdf/1706.03762.pdf> [retrieved on 20180913] *
JEAN-PIERRE BRIOT ET AL: "Deep Learning Techniques for Music Generation -- A Survey", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 6 September 2017 (2017-09-06), XP081457762 *
YU-SIANG HUANG ET AL: "Pop Music Transformer: Generating Music with Rhythm and Harmony", ARXIV.ORG, CORNELL UNIVERSITY LIBRARY, 201 OLIN LIBRARY CORNELL UNIVERSITY ITHACA, NY 14853, 1 February 2020 (2020-02-01), pages 1 - 7, XP081591444 *

Also Published As

Publication number Publication date
CN113689835A (zh) 2021-11-23

Similar Documents

Publication Publication Date Title
Donahue et al. LakhNES: Improving multi-instrumental music generation with cross-domain pre-training
Yang et al. MidiNet: A convolutional generative adversarial network for symbolic-domain music generation
Zhang Learning adversarial transformer for symbolic music generation
Chen et al. The effect of explicit structure encoding of deep neural networks for symbolic music generation
Zhao et al. An emotional symbolic music generation system based on LSTM networks
CN109448683A (zh) 基于神经网络的音乐生成方法及装置
JP6565530B2 (ja) 自動伴奏データ生成装置及びプログラム
US20150255052A1 (en) Generative scheduling method
Meade et al. Exploring conditioning for generative music systems with human-interpretable controls
Castro Performing structured improvisations with pre-trained deep learning models
Siphocly et al. Applications of computational intelligence in computer music composition
Kumar et al. Automatic Music Generation System based on RNN Architecture
Janssen et al. Algorithmic Ability to Predict the Musical Future: Datasets and Evaluation.
Maduskar et al. Music generation using deep generative modelling
WO2021236209A1 (en) Automatic music generation
Maezawa et al. Bayesian audio-to-score alignment based on joint inference of timbre, volume, tempo, and note onset timings
Haki et al. Real-time drum accompaniment using transformer architecture
Lousseief et al. Mahlernet: Unbounded orchestral music with neural networks
Turker et al. MIDISpace: Finding Linear Directions in Latent Space for Music Generation
Rawat et al. Automatic Music Generation: Comparing LSTM and GRU
Rajadhyaksha et al. Music generation with bi-directional long short term memory neural networks
Subramanian et al. Deep Learning Approaches for Melody Generation: An Evaluation Using LSTM, BILSTM and GRU Models
Asesh Markov chain sequence modeling
Assayag et al. Cocreative Interaction: Somax2 and the REACH Project
Richter Style-Specific Beat Tracking with Deep Neural Networks

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21717618

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21717618

Country of ref document: EP

Kind code of ref document: A1