CN116704980A

CN116704980A - Musical composition generation method, music generation model training method and equipment thereof

Info

Publication number: CN116704980A
Application number: CN202310970525.7A
Authority: CN
Inventors: 单勇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-08-03
Filing date: 2023-08-03
Publication date: 2023-09-05
Anticipated expiration: 2043-08-03
Also published as: CN116704980B

Abstract

The embodiment of the application provides a method for generating a musical composition, a training method of a music generation model and equipment thereof, relating to a processing technology of multimedia contents, comprising the following steps: inputting initial symbol information of a musical piece into a music generation model, sequentially determining symbol information of the musical piece, calculating attention information of the symbol information to be determined by adopting a structured attention method according to position information of the symbol information to be determined, the determined symbol information and the determined phrase in the musical piece, and determining the symbol information to be determined according to the attention information. The attention information of the symbol information to be determined is determined according to the vectors of all non-phrase ending symbol information of the strong related phrase of the current phrase, the vectors of phrase ending symbol information of the weak related phrase and the vectors of the determined symbol information of the current phrase, so that fine-granularity music information of the strong related phrase can be focused, the similarity of repeated music is improved, and musical compositions with good repeatability are generated.

Description

Musical composition generation method, music generation model training method and equipment thereof

Technical Field

The embodiment of the application relates to the technical field of multimedia content processing, in particular to a music production generation method, a music production model training method and equipment thereof.

Background

Music is a complex artistic expression form, and music creation not only needs to have relevant music theory by an creator, but also needs to have good perception and appreciation capability for art, and has great difficulty. Generally, the period of artificial music creation is longer, the time cost is larger, and the increasing demands of people on music are difficult to meet. How to efficiently and quickly automatically generate musical compositions meeting the demands of users is a difficult problem faced in the field of music creation.

With the development of artificial intelligence (Artificial Intelligence, AI) and neural network models, neural network models are gradually applied to the field of automatic music generation, however, existing automatic music works generated by adopting the neural network models generally do not have clear and complete phrase structures, do not conform to the composing modes of human beings, have low similarity of repeated phrases of generated music, and do not conform to the characteristics of the music works created by the human beings.

Disclosure of Invention

The embodiment of the application provides a music production method, a training method of a music production model and equipment thereof, which can produce the music production with better repeatability.

In a first aspect, an embodiment of the present application provides a method for generating a musical piece, the method including: generating initial symbol information of the musical piece according to a generating instruction of the musical piece; inputting the initial symbol information into a music generation model, and sequentially determining a plurality of symbol information of the musical piece, wherein each symbol information comprises a plurality of musical attributes, the music generation model is used for calculating attention information of the symbol information to be determined according to position information of the symbol information to be determined in the musical piece, the determined symbol information and the determined phrase which are positioned before the symbol information to be determined, and a structured attention method is adopted to calculate the attention information of the symbol information to be determined according to the attention information of the symbol information to be determined, and the structured attention method comprises the following steps: determining attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of a strong correlation phrase of the current phrase to which the symbol information to be determined belongs, vectors of phrase ending symbol information of a weak correlation phrase of the current phrase and vectors of determined symbol information of the current phrase, wherein the strong correlation phrase and the weak correlation phrase of the current phrase belong to the determined phrase, and the phrase ending symbol information is last symbol information of each phrase and is used for representing music information of the whole phrase; and generating a score of the musical composition according to the symbol information of the musical composition.

In a second aspect, an embodiment of the present application provides a training method for a music generation model, where the method includes: acquiring a plurality of training samples, wherein the training samples are actual symbol sequences of a musical piece, the musical piece is composed of a plurality of phrases, the actual symbol sequences comprise actual symbol information of each phrase of the musical piece, and the actual symbol information comprises a plurality of musical attributes; training the music generation model by using the training sample to obtain a predicted symbol sequence of the training sample, wherein the predicted symbol sequence comprises predicted symbol information of the musical compositions, which are sequentially determined by the music generation model; calculating a loss value of the training sample according to the actual symbol sequence and the predicted symbol sequence of the training sample; calculating a loss value of the music generation model using the loss values of the plurality of training samples; and updating parameters of the music generation model according to the loss value of the music generation model.

In a third aspect, an embodiment of the present application provides a musical piece generating apparatus, including:

the starting module is used for generating starting symbol information of the musical composition according to the generating instruction of the musical composition; a determining module, configured to input the start symbol information into a music generating model, and sequentially determine a plurality of symbol information of the musical piece, where each symbol information includes a plurality of musical attributes, where the music generating model is configured to calculate attention information of the symbol information to be determined according to position information of the symbol information to be determined in the musical piece, and determined symbol information and determined phrases that are located before the symbol information to be determined, and calculate the symbol information to be determined according to the attention information of the symbol information to be determined, where the structured attention method includes: determining attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of a strong correlation phrase of the current phrase to which the symbol information to be determined belongs, vectors of phrase ending symbol information of a weak correlation phrase of the current phrase and vectors of determined symbol information of the current phrase, wherein the strong correlation phrase and the weak correlation phrase of the current phrase belong to the determined phrase, and the phrase ending symbol information is last symbol information of each phrase and is used for representing music information of the whole phrase; and the generation module is used for generating the music score of the musical composition according to the symbol information of the musical composition.

In some alternative implementations, the music generation model includes: a music representation model, an autoregressive decoding network and an output module;

the music representation model is used for converting current symbol information which is newly generated by the music generation model into a vector of the current symbol information;

the autoregressive decoding network is used for taking the vector of the current symbol information and the position code of the symbol information to be determined as inputs, calculating the attention information of the symbol information to be determined by adopting the structured attention method, determining the decoding vector of the symbol information to be determined according to the attention information of the symbol information to be determined, wherein the symbol information to be determined is the next symbol information of the current symbol information, and the position code is obtained by encoding the position information of the symbol information to be determined in the musical piece;

when the symbol information to be determined is the non-last symbol information of the current phrase, determining attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of the strong related phrase, vectors of phrase ending symbol information of the weak related phrase and vectors of the symbol information determined in the current phrase; when the symbol information to be determined is the last symbol information of the current phrase, determining the attention information of the symbol information to be determined according to the vector of the symbol information of the current phrase;

The output module is used for determining the symbol information to be determined according to the decoding vector of the symbol information to be determined, and taking the symbol information to be determined as the input of the music representation model.

In some alternative implementations, the symbol information includes the following musical attributes: the symbol information category, phrase identification, remaining bar numbers of phrases, beat identification in bars, chords, tracks to which notes belong, pitch of notes and duration of notes;

the symbol information category includes any one of the following categories: the beginning of music, the end of music, the beginning of a bar, the starting position of a chord or note, the end of a phrase, the end category of which is the category of the last symbol information of the phrase.

In some alternative implementations, the music representation model is specifically for:

converting the value of each music attribute of the current symbol information into a vector corresponding to each music attribute;

combining vectors corresponding to all the music attributes of the current note to form a first vector;

and mapping the first vector to a linear layer with a preset dimension to obtain the vector of the current symbol information.

the value of each music attribute of the current symbol information is processed through word embedding, so that a vector corresponding to each music attribute is obtained;

the first vector is converted into a vector of the current symbol information by a multi-layer perceptron MLP.

In some alternative implementations, the method further includes:

the receiving module is used for receiving the generation parameters of the music works input by a user, wherein the generation parameters of the music works comprise at least one of the following parameters: the chord progression information of the musical composition comprises positions and names of a plurality of chords;

the output module is specifically configured to: determining a predicted value of the symbol information to be determined according to the decoding vector of the symbol information to be determined; and modifying the predicted value of the symbol information to be determined according to the generation parameter to obtain the symbol information to be determined.

In some optional implementations, when the phrase structure of the musical piece is included in the generation parameters, the output module is specifically configured to:

determining the symbol information to be determined as a first symbol signal in a new phrase, and establishing a new phrase;

Determining the identification of the new phrase and the number of the small sections of the phrase according to the phrase structure;

and modifying the phrase identification of the symbol information to be determined and the predicted value of the remaining number of the nodes of the phrase according to the identification of the new phrase and the number of the nodes of the phrase.

In some optional implementations, when the generating parameter further includes chord progression information of the musical piece, the output module is specifically configured to:

determining that the symbol information to be determined is the initial position of the chord according to the position of each chord in the chord progression information and the position of the symbol information to be determined in the current phrase;

and modifying the predicted value of the chord or the starting position of the note in the predicted symbol information of the symbol information to be determined according to the name of the chord in the chord progression information.

In some alternative implementations, the output module includes a linear layer and a softmax layer;

the linear layer is used for mapping the decoding vector of the symbol information to be determined into output vectors of all music attributes of the symbol information to be determined;

the softmax layer is used for processing the output vector of each music attribute of the symbol information to be determined to obtain probability distribution of each music attribute of the symbol information to be determined, determining predicted values of each music attribute of the symbol information to be determined according to the probability distribution of each music attribute of the symbol information to be determined, and forming the predicted values of the symbol information to be determined by the predicted values of each music attribute of the symbol information to be determined.

In some alternative implementations, the softmax layer is specifically for:

processing the output vector of the symbol information category of the symbol information to be determined to obtain probability distribution of the symbol information category of the symbol information to be determined, and determining a predicted value of the symbol information category of the symbol information to be determined according to the probability distribution of the symbol information category of the symbol information to be determined;

determining a first type of music attribute with a null value and a second type of music attribute with a non-null value in the remaining music attributes of the symbol information to be determined according to the predicted value of the symbol information category of the symbol information to be determined, wherein the first type of music attribute and the second type of music attribute included in different symbol information categories are different;

and processing the output vector of the second type of music attribute of the symbol information to be determined to obtain probability distribution of the second type of music attribute, and determining a predicted value of the second type of music attribute according to the probability distribution of the second type of music attribute.

In some alternative implementations, the music generation model is further to:

judging whether the current symbol information is music ending symbol information or not;

If the current symbol information is the music ending symbol information, determining to end the generation of the music work;

and if the current symbol information is not the music ending symbol information, taking the current symbol information as the input of the music representation layer.

In some alternative implementations, the method further includes:

and the ending module is used for generating the music ending symbol information according to the phrase structure of the music work or the maximum symbol information quantity of the music work, and taking the music ending symbol information as the current symbol information.

In some alternative implementations, the autoregressive decoding network includes L decoders, L being greater than or equal to 1;

the input of the first decoder is the vector of the current symbol information and the position code of the symbol information to be determined;

the input of the remaining decoders is the output of the adjacent previous decoder, and the output of the last decoder is the decoding vector of the symbol information to be determined;

and each decoder calculates the attention information of the symbol information to be determined by adopting the structured attention method.

In some alternative implementations, the decoder includes a structured attention module and a feed forward network;

The structured attention module is used for calculating the attention information of the symbol information to be determined by the structured attention method;

and the feedforward network is used for processing the attention information of the symbol information to be determined to obtain a decoding vector of the symbol information to be determined.

In some alternative implementations, the autoregressive decoding network is a recurrent neural network RNN model, a convolutional neural network CNN model, or a transducer model.

In a fourth aspect, an embodiment of the present application provides a training apparatus for a music generation model, the apparatus including: the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a plurality of training samples, the training samples are actual symbol sequences of a musical composition, the musical composition is composed of a plurality of phrases, the actual symbol sequences comprise actual symbol information of each phrase of the musical composition, and the actual symbol information comprises a plurality of musical attributes; the training module is used for training the music generation model by using the training sample to obtain a predicted symbol sequence of the training sample, wherein the predicted symbol sequence comprises predicted symbol information of the musical compositions which are sequentially determined by the music generation model; the loss calculation module is used for calculating a loss value of the training sample according to the actual symbol sequence and the predicted symbol sequence of the training sample; the loss calculation module is further used for calculating loss values of the music generation model by using the loss values of the training samples; and the parameter updating module is used for updating the parameters of the music generation model according to the loss value of the music generation model.

In a fifth aspect, an embodiment of the present application provides a terminal, including: a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory to perform the method according to the first aspect as described above.

In a sixth aspect, an embodiment of the present application provides a training apparatus, including: a processor and a memory for storing a computer program, the processor being adapted to invoke and run the computer program stored in the memory to perform the method according to the second aspect as described above.

In a seventh aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program, the computer program causing a computer to perform the method according to the first or second aspect.

In an eighth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when executed by a processor, implements a method as described in the first or second aspect above.

According to the technical scheme provided by the embodiment of the application, the music generation model takes initial symbol information as input, a plurality of symbol information of the music work is sequentially determined, each symbol information comprises a plurality of music attributes, the music generation model is used for calculating attention information of the symbol information to be determined according to position information of the symbol information to be determined in the music work, the determined symbol information and the determined phrase positioned in front of the symbol information to be determined by adopting a structured attention method, the symbol information to be determined is determined according to the attention information of the symbol information to be determined, and the structured attention method is used for determining the attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of a strong related phrase of a current phrase to which the symbol information to be determined belongs, vectors of phrase ending symbol information of weak related phrases of the current phrase and vectors of the determined symbol information of the current phrase. The attention information of the symbol information to be determined is calculated by adopting the structured attention method, so that fine granularity music information of strong related phrases can be focused, the similarity of repeated music is improved, a musical composition with better repeatability is generated, and the overall melody and linkage relation of the phrases are captured from coarse granularity information of weak related phrases, so that a musical composition with more harmonious melody is generated.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a symbolic representation of a musical piece according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a musical piece according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a relationship of phrase structure of a musical composition;

fig. 4 is a schematic structural diagram of a music generating model according to a second embodiment of the present application;

fig. 5 is a schematic diagram of another structure of a music generating model according to the second embodiment of the present application;

FIG. 6 is a schematic diagram of a determined token used in determining a token;

fig. 7 is a flowchart of a training method of a music generation model according to a third embodiment of the present application;

fig. 8 is a schematic structural diagram of a musical piece generating device according to a fourth embodiment of the present application;

fig. 9 is a schematic structural diagram of a training device for a music generating model according to a fifth embodiment of the present application;

Fig. 10 is a schematic structural diagram of a terminal according to a sixth embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The present application relates to artificial intelligence technology, wherein artificial intelligence is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

In particular, natural language processing (Nature Language processing, NLP) in artificial intelligence is an important direction in the field of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

In order to facilitate an understanding of the embodiments of the application, some concepts involved in all embodiments of the application will be properly elucidated before describing the various embodiments of the application.

1. Musical compositions, also known as tracks or music, may refer broadly to any artistic, pleasant, judicious or otherwise arranged sound, including rhythms, melodies, sounds of different scales, and the like.

2. A phrase, which is a basic structural unit constituting a piece of music, is composed of several bars, and it can express relatively complete meaning as a sentence in an article, and is called a phrase. A piece of music typically includes a plurality of phrases, which may be uniquely identified by a phrase identification, which may also be understood as phrase names such as phrase i (pre, intro), phrase a, phrase B (interlude), phrase o (end).

3. Bars, the most basic regular rhythm unit in the musical composition, each bar contains beats with the same number as the beats marked by the beat number.

4. Beats, beats in bars, take a 4/4 beat song as an example, represent that the score-4 note is 1 beat, 4 beats per bar.

5. Notes (note) for recording the proceeding signs of different length notes, each note having information such as pitch (pitch), duration (duration), and position (at which tempo of which bar).

6. Phrase structure, structure of phrases constituting a musical composition, the phrase structure being expressed as: "i4A4B 4A4B4o4", the musical composition has a total of 7 phrase compositions: a pre-beat i, phrase a, phrase B, an inter-beat B, phrase a, phrase B, an end beat o, 4 bars per phrase.

7. Chord (chord) refers to a group of sounds having a certain musical interval relationship, i.e., three or more sounds are combined in the longitudinal direction according to a superimposed relationship of three or non-three degrees, which is called a chord.

8. Chord progression (chord progression), which is a sequence of chords consisting of a plurality of chords arranged in sequence, thereby producing different mood effects, is the root of musics and acoustics, and is an important authoring element in a piece of music.

9. The track often has a plurality of tracks in music, and the tracks play simultaneously to form a composite listening feel, and usually each musical instrument or sounding body is a track, for example, one piece of music is formed by mixing three sounds of a human voice, a piano and a guitar, and then the human voice, the piano and the guitar are respectively a track.

10. In the repetition of the musical composition, certain phrases usually appear in the music created by the human body repeatedly to form the hearing feeling of the circulation, so that the effects of laying and enhancing the memory are achieved. For example, in popular songs, a chorus (verse) and a chorus (chord) are repeated, and the front and back phrase a or front and back phrase B have higher similarity. Wherein, the chorus is the pad part of the song, the chorus is the climax part of the song, the repetitive phrase is important for a music, and the hearing of the whole phrase is influenced.

11. Symbolic music generation, modeling the music generation of symbolized representations. Symbolizing refers to extracting discrete musical elements such as pitch, rhythm, chord, etc. from a musical composition, treating the musical composition as a sequence composed of them, and modeling the generation of the musical composition as a sequence generation task.

The existing symbolic music generation method has low similarity of repeated phrases of the music works and does not accord with the characteristics of the music works created by human beings. Based on the above, the embodiment of the application provides a method for generating musical compositions, and the generated musical compositions have high similarity in repetitive phrases, and the embodiment of the application also provides an identification of note symbol music.

Embodiments of the present application provide a representation of symbolic music, where a plurality of musical attributes (also referred to as musical elements) of a note are grouped into one symbolic message, which may be understood as a combination of the plurality of musical attributes. The symbol information may be a token, which refers to a basic unit in a text, and may be a word, a phrase, a punctuation mark, a character, etc. in the NLP, and the token is a combination word (combined word) formed by combining a plurality of musical attributes according to the requirements and methods of text processing, so that the following embodiments are described by taking the token as an example, and the token and the symbol information may be replaced with each other. Table one is a description of the musical attribute defined in the embodiment of the present application, table one is as follows:

list one

Referring to Table one, to incorporate phrase structures into the symbolic representation of music, two phrase-related musical attributes are introduced: the Phrase and BarCountdown are used to represent the identification of the Phrase to which the token belongs, and the Phrase identification may be i, A, B, b, o for distinguishing different phrases, which is not limited to the specific form of the Phrase identification in this embodiment, and can distinguish different phrases.

BarCountdown is used to indicate the number of remaining bars in a phrase, or the relative positions of the bars in a phrase, for a 4-bar phrase, barCountdown takes a value of 4 in the first bar, 3 in the second bar, 2 in the third bar, and 1 in the fourth bar.

For the current token, if "phrase=a, barcountdown=8" in the current token, it indicates that the current token belongs to phrase a, and 8 bars end with phrase a.

Beat represents Beat ID in bars, with 16 notes as granularity, and each bar has 16 positions, the value is between 1-16.

Illustratively, assume that the phrase structure of a piece of music is: A2B2A2, fig. 1 is a schematic diagram of a symbolic representation of a musical piece according to an embodiment of the present application, as shown in fig. 1, where the musical piece includes three phrases A, B, A, each of which includes two bars. In fig. 1, each column represents a token, each token includes 8 music attributes, and the space Bai Fangkuang in the figure represents that the value of the music attribute in the token is null.

Based on the symbolic music representation provided by the embodiment of the application, the embodiment of the application provides a music production generation method which can generate the music production with good phrase repeatability, namely the method provided by the embodiment of the application focuses on phrase level repetition.

The method for generating the musical composition can be applied to a terminal, and the terminal can be used for various devices such as mobile phones, tablet computers, desktop computers, portable notebook computers, intelligent voice interaction devices, intelligent household appliances and the like. The terminal is provided with an application, which may be, for example, conventional application software, cloud application software, an applet or an application module in a host application program, or a web page platform, which is not limited herein.

Alternatively, the above-mentioned application may be an audio application, a content interaction application, a short video application, a game application, an e-commerce application, an instant messaging application, or the like, without being particularly limited thereto.

By way of example, in some audio applications, for example, in music software, the method for generating a musical composition provided by the embodiment of the application can assist a user in efficiently creating a high-quality musical composition. As another example, in a man-machine conversation system or an intelligent robot, a musical composition can be generated and played according to user needs.

Having introduced some concepts related to the embodiments of the present application, a specific description will be given below of a method for generating a musical piece and a training method for a music generating model according to the embodiments of the present application with reference to the accompanying drawings.

Fig. 2 is a flowchart of a method for generating a musical piece according to an embodiment of the present application, where the method may be performed by a terminal, and as shown in fig. 2, the method of the present embodiment includes the following steps.

S101, generating initial symbol information of the musical piece according to a generation instruction of the musical piece.

The generation instruction is used for triggering the generation of the musical composition, and the generation instruction can be an instruction input by a user through a voice mode, for example, in a scene of the intelligent robot, the user interacts with the intelligent robot through the voice mode, and the user can specify the robot to generate a new song. The generation instruction may also be an instruction input by the user through a start control on a generation page of the musical piece, for example, the user opens the generation page of the musical piece, the generation page including the start control and some content related to music generation.

The terminal generates initial symbol information according to the generation instruction, the initial symbol information is used as input of a music generation model, the initial symbol information can be the BOS, and the BOS can be understood as an initial mark of a symbol sequence corresponding to the musical composition.

S102, inputting initial symbol information into a music generation model, sequentially determining a plurality of token of the musical composition, wherein each token comprises a plurality of musical attributes, the music generation model is used for determining the token to be determined according to the position information of the token to be determined in the musical composition, the determined token and the determined phrase positioned in front of the token to be determined, calculating the attention information of the token to be determined by adopting a structured attention method, and determining the token to be determined according to the attention information of the token to be determined.

The music generation model is a neural network model adopting autoregressive decoding, and the model fuses all determined token to predict the next token, so that the more the future token needs to be considered in the prediction, the more the token needs to be considered, and the model can be understood as self-prediction to a certain extent. Wherein the determined token is a token in the musical composition that has been predicted to be completed.

Illustratively, one token includes the following musical attributes: token category (family in table one), phrase identification (Phrase in table one), remaining number of bars of Phrase (BarCountdown in table one), beat identification within bars (Beat in table one), chord (Chord in table one), track to which the note belongs (Track in table one), pitch of the note (Pitch in table one), and Duration of the note (Duration in table one). Illustratively, the token categories include: BOS, EOS, BAR, POS, NOTE and CLS.

In order to achieve music repeatability when composing a music, humans pay attention to repetitive phrases (because they are imitated and kept high in similarity), but do not pay much attention to other phrases, and only need to meet the harmony of the global melody. Therefore, in the embodiment of the present application, when generating a phrase, repetitive phrases preceding the phrase are defined as strongly related phrases, and other phrases preceding the phrase are defined as weakly related phrases. Then, the structured attention mechanism is used to focus on fine granularity music information of strongly related phrases, and coarse granularity music information of weakly related phrases is used to capture the trend and phrase linkage of the global melody. Therefore, phrase works with good repeatability can be generated, and the good repeatability can be understood as high similarity of the repeated phrases.

Fig. 3 is a schematic diagram of a relationship between phrase structures of a musical composition, and as shown in fig. 3, the musical composition has 7 paragraphs in total, which are in turn: intro (prelude), verse (Chorus), chord (Chorus), bridge (bridging), verse, chord and out (endplay). The musical composition has 11 phrases in total, and the musical composition sequentially comprises: iabbbaaabbo, for the 3 rd phrase a in the figure (i.e., the 7 th phrase of the musical piece), the preceding phrase is iabbb, where phrase AA is a strongly related phrase of that phrase and the other phrases iBBb are weakly related phrases of that phrase.

In this embodiment, the attention information of the token to be determined is calculated using a structured attention method, which is a variant of the self-attention mechanism, which is to let the model notice the correlation between different parts of the whole input. The self-attention mechanism computes a contextual representation of the query vector (i.e., q) versus the key vector (i.e., k), the value vector (i.e., v), where k and v are the same, and the attention information can be computed by the following equation (1):

（1）

wherein, the liquid crystal display device comprises a liquid crystal display device,、/>and->Representing the linear mapping, respectively.

According to the structured attention method, different determined phrases are distinguished according to the correlation between the different phrases and the current phrase, the different determined phrases are divided into strongly correlated phrases and weakly correlated phrases, and when attention information is calculated, music information focused in the strongly correlated phrases and the weakly correlated phrases is different, so that better music with repeatability can be generated.

Exemplary, structured attention methods include: determining attention information of the token to be determined according to vectors of all non-phrase ending token (i.e., CLS) of the strongly related phrase of the current phrase to which the token belongs, vectors of CLS of the weakly related phrase of the current phrase and vectors of the determined token of the current phrase, the strongly related phrase and the weakly related phrase of the current phrase belonging to the determined phrase, and CLS being the last token of each phrase for representing music information of the whole phrase.

The input and output of the structured attention method are both vectors, so each determined token needs to be converted into a corresponding token vector before the structured attention is calculated.

The position information of the token to be determined in the musical piece is used to represent the relative position information of the token to be determined in the whole musical piece, for example, the position of the current phrase to which the token to be determined belongs and the position of the token to be determined in the current phrase, and the position of the token to be determined in the current phrase may be the number of the token to be determined in the current phrase or the position relationship between the token to be determined and the adjacent token. The location information of the token to be determined in the musical piece may be determined based on the determined phrase and the determined token.

The music generation model determines a strong correlation phrase and a weak correlation phrase of the current phrase from the determined phrases according to the identification of the current phrase to which the token to be determined belongs and the identification of the determined phrases in the musical composition, wherein the strong correlation phrase is the determined phrase which is the same as the identification of the current phrase, and the weak correlation phrase is the determined phrase which is different from the identification of the current phrase.

For the strongly correlated phrase, vectors of all non-CLSs (i.e., other token except CLSs) in the strongly correlated phrase are used to calculate attention information to focus on fine-grained music information of the strongly correlated phrase, so that similarity of repetitive music can be improved, and musical compositions with better repeatability can be generated. For weak related phrases, the vector of the CLS in the weak related phrases is used for calculating attention information so as to pay attention to coarse-granularity music information of the weak related phrases, and the linking relation between the global melody and the phrases is captured from the weak related phrase information, so that a musical composition with more harmonious melody and graceful melody is generated.

It will be appreciated that for the first phrase in the musical piece, where there are no strongly and weakly correlated phrases, the values of the strongly and weakly correlated phrases of the first phrase may be considered to be null, and then the attention information of the token to be determined is calculated from the vector of the determined token of the current phrase and the location information of the token to be determined in the musical piece. For some phrases in a musical piece, there may be only strongly related phrases, only weakly related phrases, and both strongly and weakly related phrases.

S103, generating a music score of the musical composition according to the token of the musical composition.

The music generation model sequentially determines the token in the music work, and when the ending condition is met, the prediction of the token is stopped. Each token predicted by the music generation model forms a symbol sequence of the musical composition, and the symbol sequence is converted into a music score of the musical composition.

When converting the symbol sequence of a musical composition into a score, each token in the musical composition is converted into musical information, the musical information is filled into a musical instrument digital interface (Musical Instrument Digital Interface, MIDI) message, which is a very widely used musical expression format, and can be called a "computer-understandable score", and the MIDI message is a kind of timing information for expressing and controlling music. For example, BAR corresponds to the beginning of a BAR, POS corresponds to the position of a chord or NOTE, NOTE corresponds to the pitch of a NOTE, etc., and keys are converted to music information one by one using a muspy kit or a MIDI kit to fill MIDI.

In converting a token sequence of a musical composition into a score, one token does not correspond to a NOTE, e.g., a token whose family is NOTE describes information such as pitch, duration, track, phrase of the NOTE, a token whose family is POS describes beat and chord information of the NOTE, and a BAR and a CLS describe the beginning of a BAR and the end of the phrase, respectively, and do not directly correspond to the NOTE. The specific conversion process is the prior art, and will not be described in detail here.

According to the method, a starting token of a musical piece is generated according to a generation instruction of the musical piece, the starting token is input into a music generation model, the tokens of the musical piece are sequentially determined, each token comprises a plurality of music attributes, the music generation model is used for calculating attention information of the token to be determined according to position information of the token to be determined in the musical piece, the determined token and the determined phrase located in front of the token to be determined, the attention information of the token to be determined is calculated by adopting a structured attention method, the token to be determined is determined according to the attention information of the token to be determined, and the attention information of the token to be determined is determined by adopting the structured attention method according to all non-CLS vectors of a current phrase to which the token to be determined belongs, the CLS vectors of a weak related phrase of the current phrase and the determined token of the current phrase. Therefore, fine granularity music information of strong related phrases can be focused, similarity of repeated music is improved, musical compositions with good repeatability are generated, and the connection relationship between global melodies and phrases is captured from coarse granularity information of weak related phrases, so that musical compositions with more harmonious melodies are generated.

On the basis of the first embodiment, as an alternative implementation manner, the music generation model includes: fig. 4 is a schematic structural diagram of a music generation model according to a second embodiment of the present application, and as shown in fig. 4, two inputs of the autoregressive decoding network are: the method comprises the steps of outputting a vector of a token and a position code of the token to be determined by a music representation model, wherein the position code of the token to be determined is obtained by encoding position information of the token to be determined in a musical composition. The autoregressive decoding network calculates the attention information of the token to be determined by adopting a structured attention method, the output of the autoregressive decoding network is the decoding vector of the token to be determined, and the output module processes the decoding vector of the token to be determined to obtain the token to be determined.

The autoregressive decoding network comprises L decoders, L is greater than or equal to 1, each decoder calculates attention information of a token to be determined by adopting a structured attention method, and the L decoders run in series.

The inputs of the first decoder are the vector of the current token and the position code of the token to be determined, which is obtained by encoding the position information of the token to be determined in the musical piece. The input to the first decoder is a superposition of the vector of the current token and the position code of the token to be determined, so that the position code of the token to be determined is a vector of the same dimension as the vector of the current token, e.g. d, and then d.

The input of the remaining decoder is the output of the next preceding decoder and the output of the last decoder is the decoded vector of the token to be determined.

Alternatively, the autoregressive decoding network may be a recurrent neural network (Recurrent Neural Network, RNN) model, a convolutional neural network (Convolutional Neural Networks, CNN) model, or a transducer model. In different networks, the decoder may have different structures and different position encoding methods, and the present embodiment may encode the position information of the token to be determined by using an existing encoding method.

Fig. 5 is a schematic diagram of another structure of a music generating model according to a second embodiment of the present application, where, as shown in fig. 5, an autoregressive decoding network is composed of L decoders, each decoder includes a structured attention module and a feedforward network, where the structured attention module is used for calculating attention information of a token to be determined by a structured attention method, and the feedforward network is used for processing the attention information of the token to be determined to obtain a decoding vector of the token to be determined, and processing of the feedforward network includes, but is not limited to, nonlinear transformation processing. The output module comprises a linear layer and a softmax layer, and the decoded vector of the token to be determined, which is output by the autoregressive decoding network, sequentially passes through the linear layer and the softmax layer to be processed, so that the token to be determined is obtained.

The data processing procedure of the music representation model, the autoregressive decoding network and the output module is described in detail below.

The music representation model is used for converting the current token newly generated by the music generation model into a vector of the current token.

Illustratively, converting the value of each music attribute of the current token into a vector corresponding to each music attribute; combining vectors corresponding to all music attributes of the current token to form a first vector; mapping the first vector to a linear layer with a preset dimension to obtain the vector of the current token.

Alternatively, the music representation model may use a word embedding (embedding) manner to convert the value of each music attribute included in the current token into a vector corresponding to each music attribute.

In the natural language processing, word embedding is a representation method of mapping words or other text units in text into a continuous vector space, and in this embodiment, the values of the music attributes of the token are converted into a vector through the word embedding processing.

Alternatively, the first vector may be converted to the vector of the current token by a multi-layer perceptron (Multilayer Perceptron, MLP).

The above-described conversion of the token to the token vector may be expressed by a formula, assuming that the musical piece X contains M phrases, the musical piece X may be expressed as:the ith phrase is composed of +.>A composition of the number of token's,each token->Vector and vector of the entire phrase +.>Can be represented by formulas (2) and (3).

（2）/>

（3）

Wherein, the liquid crystal display device comprises a liquid crystal display device,represents the j-th token +_in the i-th phrase>E represents a set of musical properties, +.>Representation word embedding process,/->Representing vector concatenation->Representing a multi-layer perceptron, d representing the vector dimension of token, +.>Representing the vector space of the token.

Illustratively, the vector after the word embedding process is a 512-dimensional vector, the input of the MLP is an 8 x 512-dimensional vector, and the vector after the MLP process is changed into a d-dimensional vector.

The autoregressive decoding network is used for taking the vector of the current token and the position code of the token to be determined as input, calculating the attention information of the token to be determined by adopting a structured attention method, determining the decoding vector of the token to be determined according to the attention information of the token to be determined, wherein the token to be determined is the next token of the current token, and the position code is obtained according to the position information code of the token to be determined in the musical composition.

When the token to be determined is the non-last token of the current phrase, determining the attention information of the token to be determined according to all non-CLS vectors of the strongly related phrase of the current phrase, the CLS vectors of the weakly related phrase of the current phrase and the vectors of the determined token in the current phrase. When the token to be determined is the last token of the current phrase, determining the attention information of the token to be determined according to the vector of the token of the current phrase.

A phrase includes a plurality of token, and the plurality of token categories generally include BAR, POS, NOT and CLS, where the CLS is the last token in the phrase, and in this embodiment, the method for calculating the attention information of the CLS is different from the method for calculating the attention information of other token.

When the token to be determined is the non-last token of the current phrase, i.e. when the token to be determined is non-CLS, the attention information of the j-th token to be determined of phrase iCan be represented by the following formula (4):

（4）

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the total number of tokens in the current phrase, < >>Aggregation of vectors representing all non-CLSs in a strongly related phrase of the current phrase, +.>Aggregation of vectors of CLSs representing weakly related phrases of the current phrase, +.>Representing the aggregation of vectors in the current phrase for which a token has been determined.

Assume that the identity of a strongly related phrase of the current phrase isThen->、/>、/>Can be represented by the following formulas (5) - (7).

（5）

（6）

（7）

Indicate->Total number of token for strongly related phrases, < >>Representing the i-th strongly related phrase +.7 in equation (7)>Smaller than j means that +.>When using the determined token in the current phrase +.>Equal to j means that ∈j is calculated>The token to be predicted itself is used, and the token to be predicted itself is understood to be the position information of the token to be predicted in the musical piece. />

When the token to be determined is the last token of the current phrase, i.e. when the token to be determined is CLS, the attention information of the token to be determinedCan be represented by the following formula (8):

（8）

The vector representing the ith clause (i.e., the current clause) is composed of the vectors of all token of the ith clause.

FIG. 6 is a schematic diagram of a determined token used in determining a token, as shown in FIG. 6, in determining each token in a third phrase (i.e., a second phrase A), the first phrase A being a strongly related phrase of the third phrase, the second phrase (i.e., phrase B) being a weakly related phrase of the third phrase, the direction indicated by the arrow in the figure being the direction of information flow, i.e., the token shown above is aggregated from the tokens shown below.

Taking the first token of the third phrase, i.e., the BAR of the third phrase, as an example, all but the CLS of the first phrase (i.e., the token to which the solid line is connected), the CLS of the second phrase (i.e., the token to which the first type of dashed line is connected), and the first token of the third phrase (i.e., the token to which the second type of dashed line is connected) are used in predicting the BAR, the solid line represents the token of the non-CLS of the strongly related phrase, the first type of dashed line represents the CLS of the weakly related phrase, and the second type of dashed line represents the determined token of the current phrase, it is understood that the BAR of the third phrase is not predicted at this time, but the location information of the BAR of the third phrase is known, and therefore, the BAR information of the third phrase is also used in determining the BAR of the third phrase.

Taking the last token of the third phrase, the CLS of the third phrase, for example, all of the tokens in phrase a are used in predicting the CLS.

And (III) an output module is used for determining the token to be determined according to the decoding vector of the token to be determined, and taking the token to be determined as the next input of the music representation model.

In the embodiment of the application, two decoding methods are supported: unconditional decoding and conditional decoding. Wherein the unconditional decoding user does not need to input any information, and the conditional decoding user needs to input the generation parameters of the musical piece, wherein the generation parameters of the musical piece comprise at least one of the following parameters: the phrase structure of the musical composition, the chord progression information including the positions and names of the plurality of chords.

In conditional decoding, a user designates a generation parameter, and a musical piece which meets the user's desire can be generated. Before the terminal generates the BOS of the musical piece according to the generating instruction of the musical piece, receiving the generating parameter of the musical piece input by a user, taking the generating parameter as the input of a music generating model, and gradually determining the next token by the music generating model according to the BOS and the generating parameter until an ending condition is detected, and stopping predicting the token.

In one implementation, the output module determines a predicted value of the token to be determined according to the decoding vector of the token to be determined, and modifies the predicted value of the token to be determined according to the generation parameters of the musical composition to obtain the token to be determined, wherein the predicted value of the token to be determined includes predicted values of all the musical attributes of the token to be determined.

For example, when the phrase structure of the musical composition is i4A4B 4A4B4o4, and the generating parameter includes the phrase structure of the musical composition, determining that the to-be-determined token is the first token in the new phrases, establishing a new phrase, determining the identity of the new phrase and the number of the nodes of the phrase according to the phrase structure, and modifying the predicted value of the phrase identity and the number of the nodes remaining in the predicted value of the to-be-determined token according to the identity and the number of the nodes of the new phrase, namely modifying the value of the phrase identity and the number of the nodes remaining in the predicted value of the to-be-determined token from the predicted value to the identity and the number of the nodes of the new phrase.

In this embodiment, whether the token to be determined is the first token of the new phrases may be determined according to the previous token adjacent to the token to be determined, that is, the category of the current token, if the category of the current token is BOS or CLS, it is described that the next token is the first token of the new phrases, it may also be necessary to determine whether the number of determined phrases reaches the total number of phrases of the musical piece according to the phrase structure, if the number of determined phrases does not reach the total number of phrases of the musical piece, then the new phrases are continuously generated, and if the number of determined phrases reaches the total number of phrases of the musical piece, then the new phrases are not continuously generated.

When determining that the token to be determined is the first token in the new phrases, whether the new phrases are repetitive phrases or not can be judged according to the structures of the phrases and the positions of the new phrases, and if the new phrases are the repetitive phrases, the identification of the new phrases is determined according to the identification of the generated phrases, wherein the identification of the repetitive phrases is the same. If the new phrase is not a repetitive phrase, then an identification of the new phrase is generated that is not identical to the identification of the already generated phrase.

If the token to be determined is not the first token in the new phrase, in one implementation, the predicted value of the token to be determined is not modified; in another implementation, the predicted value of the token to be determined is modified according to the first token in the new phrase, so that the phrase identifications of all the tokens in the new phrase are the same and the remaining number of the nodes of the phrase is correct.

When the generation parameters also comprise chord progression information of the musical composition, the output module determines that the token to be determined is the initial position of the chord according to the positions of the chords in the chord progression information and the positions of the token to be determined in the current phrase; and modifying the predicted value of the chord or the starting position of the note in the predicted value of the token to be determined according to the name of the chord in the chord progression information, namely modifying the value of the chord or the starting position of the note to be determined from the predicted value to the name of the chord designated by the user.

Exemplary chord progression information of a musical composition is: chord (name=c: maj, bar=1, bean=1/16), chord (name=a: min, bar=5, bean=1/16), the Chord progression information has the meaning: a chord named C is inserted at the 1 st 16 th note position of the first measure of the phrase (i.e. bar=1) and a chord named a is inserted at the 1 st 16 th note position of the fifth measure of the phrase.

The chord position is designated in the chord going information, whether the chord position in the chord going information is matched with the position of the to-be-determined token in the current phrase is judged, if the chord position designated in the chord going information is matched with the position of the to-be-determined token in the current phrase, the to-be-determined token is determined to be the starting position of the chord, and the to-be-determined token is also understood as the type of the to-be-determined token to be the POS. After determining that the token to be determined is the starting position of the chord, or determining that the type of the token to be determined is POS, modifying the value of the starting position of the chord or the note in the predicted value of the token to be determined into the name of the matched chord according to the name of the matched chord.

Taking Chord (name=c: maj, bar=1, bean=1/16), chord (name=a: min, bar=5, bean=1/16) as an example, it is determined whether the position of the token in the current phrase is the position of the 1 st 16 th note of the first measure of the phrase or whether it is the position of the 1 st 16 th note of the fifth measure of the phrase, and if it is the position of the token in the current phrase that matches the position of the 1 st 16 th note of the fifth measure, the value of the predicted value of the token to be determined and the initial position of the Chord or note is set to a: min.

Optionally, the output module includes a linear layer and a softmax layer, and determining the token to be determined according to the decoding vector of the token to be determined may include the following steps: mapping the decoding vector of the token to be determined into output vectors of all music attributes of the token to be determined through the linear layer; and processing the output vector of each music attribute of the token to be determined through the softmax layer to obtain probability distribution of each music attribute of the token to be determined, determining predicted values of each music attribute of the token to be determined according to the probability distribution of each music attribute of the token to be determined, and forming the predicted values of the token to be determined by the predicted values of each music attribute of the token to be determined.

The token to be determined comprises a plurality of music attributes, and when the predicted value of each music attribute of the token to be determined is determined, the values of the music attributes are not sequenced among the predicted values.

In one implementation, the value of the token category (i.e., famliy) is determined first, and then the values of other music attributes are determined, so that the decoding efficiency is improved, because the music attributes with null values exist in the token, the music attributes with null values in different token categories are different, and the value of the music attributes with null values does not need to be determined, so that the token category of the token to be determined is determined first, the music attributes with null values in the token to be determined can be filtered, and only the values of the rest music attributes need to be determined.

Specifically, the output vector of the token class of the token to be determined is processed through a softmax layer to obtain the probability distribution of the token class of the token to be determined, and the predicted value of the token class of the token to be determined is determined according to the probability distribution of the token class of the token to be determined. The probability distribution of the token class of the token to be determined is the probability that the token class belongs to each value, for example, 6 kinds of values are shared by the token class, the probability distribution of the token class is the probability that the token class belongs to the 6 kinds of values, the sum of the probabilities that the token class belongs to the 6 kinds of values is 1, and the value with the largest probability is the predicted value of the token class.

After the predicted value of the token category of the token to be determined is obtained, determining a first type of music attribute with a null value and a second type of music attribute with a non-null value in the remaining music attributes of the token to be determined according to the predicted value of the token category of the token to be determined, wherein the first type of music attribute and the second type of music attribute included in different token categories are different. And processing the output vector of the second type of music attribute of the token to be determined through a softmax layer to obtain probability distribution of the second type of music attribute, and determining a predicted value of the second type of music attribute according to the probability distribution of the second type of music attribute. And for the first type of music attribute, processing is not needed, and the value of the first type of music attribute is set to be null.

In the embodiment of the application, the generation of the musical composition is triggered by the BOS, and the determination of when the musical composition ends is also needed. In one implementation, the music generation model is further configured to determine whether the current token is a music end token (i.e., EOS), determine to end generation of the musical piece if the current token is EOS, generate the musical piece according to the determined token of the musical piece, and if the current token is not EOS, use the current token as an input of the music presentation layer to continue prediction of the next token.

The EOS is generated based on a phrase structure of a musical piece or a maximum number of keys of the musical piece, also referred to as a maximum sequence length of the musical piece, which can reflect the length of the musical piece.

When the determined number of tokens of the musical piece reaches the maximum number of tokens of the musical piece, the EOS is generated, and the music generation model determines to end the generation of the musical piece according to the EOS.

The maximum number of the tokens of the musical composition can be a preset fixed number, or can be a number determined according to a music structure input by a user, the number of the tokens in each phrase and the number of the phrases can be determined according to a phrase structure input by the user, and the total number of the tokens is obtained according to the number of the phrases and the number of the tokens in the phrase, wherein the total number of the tokens is the maximum number of the tokens of the musical composition.

Or when the token is to be determined to be the CLS, judging whether the current phrase is the last phrase of the musical piece according to the phrase structure of the musical piece. If the current phrase is the last phrase of the musical composition, then an EOS is generated and is used as the current token input music generation model. If the current phrase is not the last phrase of the musical piece, determining not to end the generation of the musical piece, not generating EOS, and taking the token to be determined as the current token to input a music generation model.

In another implementation, the EOS may not be set, where the generation of the musical piece is determined to be ended when the number of determined tokens of the musical piece reaches the maximum number of tokens of the musical piece, or where the token to be determined is a CLS, and where the current phrase is determined to be the last phrase of the musical piece according to the phrase structure.

In the above embodiment, each token includes a plurality of music attributes as an example, in one implementation manner, each token may include only one music attribute, which is equivalent to tiling the plurality of music attributes of the token into a long sequence, and accordingly, when determining the token vector according to the token, the music representation model does not need to fuse the plurality of music attributes, that is, does not need to perform the MLP operation in the formula (2).

On the basis of the first embodiment and the second embodiment, a third embodiment of the present application provides a training method for a music generating model, in which the training device trains to obtain the music generating model in the above embodiment, and sends the music generating model obtained by training to a terminal.

Fig. 7 is a flowchart of a training method of a music generation model according to a third embodiment of the present application, and as shown in fig. 7, the method provided in this embodiment includes the following steps.

S201, a plurality of training samples are obtained, the training samples are actual symbol sequences of a musical composition, the musical composition is composed of a plurality of phrases, the actual symbol sequences comprise actual symbol information of all the phrases of the musical composition, and the actual token comprises a plurality of musical attributes.

A large number of training samples are acquired for training a music generation model, and each actual token of each phrase of a musical composition is also called a label of the training sample, wherein the actual token can be obtained through manual labeling or machine labeling, and each actual token comprises a plurality of musical attributes.

Illustratively, the token includes the following musical attributes: token category, phrase identification, remaining bars of the phrase, beat identification within bars, chord, track to which the note belongs, pitch of the note, and duration of the note. Illustratively, the token categories include: BOS, EOS, BAR, POS, NOTE and CLS.

S202, training the music generation model by using the training sample to obtain a predicted symbol sequence of the training sample, wherein the predicted symbol sequence comprises predicted tokens of the musical compositions sequentially determined by the music generation model.

And inputting each actual token of the training sample into a music generation model to train so as to obtain predicted tokens of the musical composition, wherein each actual token corresponds to one predicted token. In the training process, when the j+1th token is predicted, the j+1th token is determined by taking the position information of the j actual token and the j+1th token of the music work as the input of a music generation model, so that the j+1th predicted token is obtained, and when the attention information of the j+1th token is calculated, the attention information of the j+1th token is calculated by adopting the structured attention method. In calculating attention information of the j+1th token using the structured attention method, attention information of the j+1th token is calculated using an actual token preceding the j+1th token in the musical piece.

Optionally, when determining each predicted token, the value of the token class is determined first, and the value of the other music attribute is determined.

The phrase identifiers of repeated phrases in the training sample are the same, so that the music generation model can distinguish which are strongly related phrases and which are weakly same phrases when determining each prediction token, and the attention information of the token is calculated by adopting a structured attention method.

S203, calculating a loss value of the training sample according to the actual symbol sequence and the predicted symbol sequence of the training sample.

The actual symbol sequence and the predicted symbol sequence of the training sample are the same in length, namely each actual token corresponds to one predicted token, when the loss value of the training sample is calculated, the actual token and the predicted token at the same position are used for calculating the loss value, and the loss values of the tokens at all positions are added to obtain the loss value of the training sample. The loss values of the training samples include, but are not limited to, cross entropy loss, and other loss functions may also be used to calculate the loss values.

S204, calculating the loss value of the music generation model by using the loss values of the training samples.

The loss values of the plurality of training samples can be overlapped and summed to obtain the loss value of the music generation model, and the loss values of the plurality of training samples can be weighted and summed to obtain the loss value of the music generation model.

S205, updating parameters of the music generation model according to the loss value of the music generation model.

After the loss value of the music generation model is obtained through calculation, the parameters of the music generation model are updated according to the loss value of the music generation model, the music generation model is continuously optimized, and the music generation model meeting the user requirements is obtained through multiple rounds of training.

In this embodiment, the specific implementation manner of determining each token by the music generation model refers to the descriptions of the first embodiment and the second embodiment, and will not be repeated here.

In order to facilitate better implementation of the method for generating the musical composition according to the embodiment of the application, the embodiment of the application also provides a device for generating the musical composition. Fig. 8 is a schematic structural diagram of a musical piece generating device according to a fourth embodiment of the present application, where the musical piece generating device 100 may include:

a starting module 11 for generating starting symbol information of the musical composition according to the generating instruction of the musical composition;

a determining module 12, configured to input the start symbol information into a music generating model, and sequentially determine a plurality of symbol information of the musical piece, where each symbol information includes a plurality of music attributes, where the music generating model is configured to calculate attention information of the symbol information to be determined according to position information of the symbol information to be determined in the musical piece, and determined symbol information and determined phrases that precede the symbol information to be determined, using a structured attention method, and determine the symbol information to be determined according to the attention information of the symbol information to be determined, where the structured attention method includes: determining attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of a strong correlation phrase of the current phrase to which the symbol information to be determined belongs, vectors of phrase ending symbol information of a weak correlation phrase of the current phrase and vectors of determined symbol information of the current phrase, wherein the strong correlation phrase and the weak correlation phrase of the current phrase belong to the determined phrase, and the phrase ending symbol information is last symbol information of each phrase and is used for representing music information of the whole phrase;

A generating module 13, configured to generate a score of the musical piece according to the symbol information of the musical piece.

In some implementations, the music generation model includes: a music representation model, an autoregressive decoding network and an output module;

In some implementations, the symbol information includes the following musical attributes: the symbol information category, phrase identification, remaining bar numbers of phrases, beat identification in bars, chords, tracks to which notes belong, pitch of notes and duration of notes;

In some implementations, the music representation model is specifically for:

the mapping the first vector to a linear layer with a preset dimension to obtain a vector of the current symbol information includes:

In some implementations, further comprising:

In some implementations, when the phrase structure of the musical piece is included in the generation parameters, the output module is specifically configured to:

In some implementations, when the generation parameter further includes chord progression information of the musical piece, the output module is specifically configured to:

In some implementations, the output module includes a linear layer and a softmax layer;

The softmax layer is used for processing the output vector of each music attribute of the symbol information to be determined to obtain probability distribution of each music attribute of the symbol information to be determined, and determining predicted values of each music attribute of the symbol information to be determined according to the probability distribution of each music attribute of the symbol information to be determined, wherein the predicted values of each music attribute of the symbol information to be determined form predicted values of the symbol information to be determined.

In some implementations, the softmax layer is specifically for:

In some implementations, the music generation model is further to:

In some implementations, further comprising:

In some implementations, the autoregressive decoding network includes L decoders, L being greater than or equal to 1;

In some implementations, the decoder includes a structured attention module and a feed forward network;

In some implementations, the autoregressive decoding network is a recurrent neural network RNN model, a convolutional neural network CNN model, or a transducer model.

In order to facilitate better implementation of the training method of the music generation model in the embodiment of the application, the embodiment of the application also provides a training device of the music generation model. Fig. 9 is a schematic structural diagram of a training device for a music generating model according to a fifth embodiment of the present application, where the training device 200 for a music generating model may include:

An obtaining module 21, configured to obtain a plurality of training samples, where the training samples are an actual symbol sequence of a musical piece, and the musical piece is composed of a plurality of phrases, and the actual symbol sequence includes actual symbol information of each phrase of the musical piece, and the actual symbol information includes a plurality of musical attributes;

the training module 22 is configured to train the music generation model by using the training sample, so as to obtain a predicted symbol sequence of the training sample, where the predicted symbol sequence includes predicted symbol information of the musical piece sequentially determined by the music generation model;

a loss calculation module 23, configured to calculate a loss value of the training sample according to the actual symbol sequence and the predicted symbol sequence of the training sample;

the loss calculation module 23 is further configured to calculate a loss value of the music generation model using the loss values of the plurality of training samples;

and the parameter updating module 24 is used for updating the parameters of the music generation model according to the loss value of the music generation model.

It should be understood that apparatus embodiments and method embodiments may correspond with each other and that similar descriptions may refer to the method embodiments. To avoid repetition, no further description is provided here.

The music piece generating apparatus 100 and the training apparatus 200 of the music generation model according to the embodiment of the present application are described above from the viewpoint of functional blocks with reference to the drawings. It should be understood that the functional module may be implemented in hardware, or may be implemented by instructions in software, or may be implemented by a combination of hardware and software modules. Specifically, each step of the method embodiment in the embodiment of the present application may be implemented by an integrated logic circuit of hardware in a processor and/or an instruction in a software form, and the steps of the method disclosed in connection with the embodiment of the present application may be directly implemented as a hardware decoding processor or implemented by a combination of hardware and software modules in the decoding processor. Alternatively, the software modules may be located in a well-established storage medium in the art such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, and the like. The storage medium is located in a memory, and the processor reads information in the memory, and in combination with hardware, performs the steps in the above method embodiments.

As shown in fig. 10, fig. 10 is a schematic structural diagram of a terminal according to a sixth embodiment of the present application, where the terminal 300 includes a processor 31 with one or more processing cores, a memory 32 with one or more computer readable storage media, a computer program stored in the memory 32 and executable on the processor, and a display 33. The processor 31 is electrically connected to the memory 32 and the display 33. It will be appreciated by those skilled in the art that the computer device structure shown in the figures is not limiting of the computer device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The processor 31 is a control center of the terminal 300, connects various parts of the entire terminal 300 using various interfaces and lines, and performs various functions of the terminal 300 and processes data by running or loading software programs and/or modules stored in the memory 32 and calling data stored in the memory 32 to implement the embodiments of the present application and the methods in the embodiments.

The display 33 may be used to display a graphical user interface and receive operating instructions generated by a user acting on the graphical user interface. The display screen 33 may be a touch display screen, which may include a display panel and a touch panel. Wherein the display panel may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of a computer device, which may be composed of graphics, text, icons, video, and any combination thereof. Alternatively, the display panel may be configured in the form of a liquid crystal display (LCD, liquid Crystal Display), an Organic Light-Emitting Diode (OLED), or the like. The touch panel may be used to collect touch operations on or near the user (such as operations on or near the touch panel by the user using any suitable object or accessory such as a finger, stylus, etc.), and generate corresponding operation instructions, and the operation instructions execute corresponding programs. Alternatively, the touch panel may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device, converts it into touch point coordinates, and sends the touch point coordinates to the processor 31, and can receive and execute commands sent from the processor 31. The touch panel may overlay the display panel, and upon detection of a touch operation thereon or thereabout, the touch panel is transferred to the processor 31 to determine the type of touch event, and the processor 31 then provides a corresponding visual output on the display panel in accordance with the type of touch event. In the embodiment of the present application, the touch panel and the display panel may be integrated into the display screen 33 to realize the input and output functions. In some embodiments, however, the touch panel and the touch panel may be implemented as two separate components to perform the input and output functions. I.e. the display 33 may also implement an input function as part of the input unit 36.

Optionally, as shown in fig. 10, the terminal 300 further includes: radio frequency circuitry 34, audio circuitry 35, input unit 36, and power supply 37. The processor 31 is electrically connected to the rf circuit 34, the audio circuit 35, the input unit 36 and the power source 37, respectively. It will be appreciated by those skilled in the art that the terminal structure shown in fig. 10 is not limiting of the terminal and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The radio frequency circuitry 34 may be used to transceive radio frequency signals to establish wireless communication with a network device or other computer device via wireless communication.

The audio circuit 35 may be used to provide an audio interface between the user and the computer device through a speaker, microphone. The audio circuit 35 may transmit the received electrical signal converted from audio data to a speaker, and convert the electrical signal into a sound signal for output; on the other hand, the microphone converts the collected sound signals into electrical signals, which are received by the audio circuit 35 and converted into audio data, which are processed by the audio data output processor 31 for transmission to, for example, another computer device via the radio frequency circuit 34, or which are output to the memory 32 for further processing. Audio circuitry 35 may also include an ear bud jack to provide communication of the peripheral headphones with the computer device.

The input unit 36 may be used to receive input numbers, character information or object feature information (e.g., fingerprint, iris, facial information, etc.), and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control.

The power supply 37 is used to power the various components of the terminal 300. Alternatively, the power supply 37 may be logically connected to the processor 31 through a power management system, so that functions of charge, discharge, and power consumption management are performed through the power management system. The power supply 37 may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown in fig. 10, the terminal 300 may further include a camera, a sensor, a wireless fidelity module, a bluetooth module, etc., which will not be described herein.

The embodiment of the application also provides training equipment, which comprises a processor and a memory, wherein the memory is used for storing a computer program, and the processor is used for calling and running the computer program stored in the memory so as to execute the method steps shown in the third embodiment. The training device is similar in construction to the terminal structure, and reference may be made to the terminal structure shown in fig. 10, it being understood that the training device may include more or fewer components than the terminal.

The present application also provides a computer storage medium having stored thereon a computer program which, when executed by a computer, enables the computer to perform the method of the above-described method embodiments, the computer being either a terminal or a training device.

The present application also provides a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the electronic device reads the computer program from the computer readable storage medium, and the processor executes the computer program, so that the electronic device executes the corresponding flow in the above method embodiment, which is not described herein for brevity. The computer may be a terminal or a training device.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules illustrated as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. For example, functional modules in various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The above embodiments are merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of generating a musical composition, comprising:

generating initial symbol information of the musical piece according to a generating instruction of the musical piece;

Inputting the initial symbol information into a music generation model, and sequentially determining a plurality of symbol information of the musical piece, wherein each symbol information comprises a plurality of musical attributes, the music generation model is used for calculating attention information of the symbol information to be determined according to position information of the symbol information to be determined in the musical piece, the determined symbol information and the determined phrase which are positioned before the symbol information to be determined, and a structured attention method is adopted to calculate the attention information of the symbol information to be determined according to the attention information of the symbol information to be determined, and the structured attention method comprises the following steps: determining attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of a strong correlation phrase of the current phrase to which the symbol information to be determined belongs, vectors of phrase ending symbol information of a weak correlation phrase of the current phrase and vectors of determined symbol information of the current phrase, wherein the strong correlation phrase and the weak correlation phrase of the current phrase belong to the determined phrase, and the phrase ending symbol information is last symbol information of each phrase and is used for representing music information of the whole phrase;

And generating a score of the musical composition according to the symbol information of the musical composition.

2. The method of claim 1, wherein the music generation model comprises: a music representation model, an autoregressive decoding network and an output module;

3. The method of claim 2, wherein the symbol information includes the following musical attributes: the symbol information category, phrase identification, remaining bar numbers of phrases, beat identification in bars, chords, tracks to which notes belong, pitch of notes and duration of notes;

4. A method according to claim 2 or 3, wherein said converting current symbol information newly generated by said music generation model into a vector of said current symbol information comprises:

converting the value of each music attribute included in the current symbol information into a vector corresponding to each music attribute;

combining vectors corresponding to all music attributes of the current symbol information to form a first vector;

5. The method of claim 4, wherein converting the value of each musical attribute included in the current symbol information into a vector corresponding to each musical attribute, comprises:

6. The method of claim 2, wherein prior to generating the starting symbol information for the musical piece in accordance with the musical piece generation instruction, further comprising:

receiving a generation parameter of the musical piece input by a user, wherein the generation parameter of the musical piece comprises at least one of the following parameters: the chord progression information of the musical composition comprises positions and names of a plurality of chords;

The determining the symbol information to be determined according to the decoding vector of the symbol information to be determined, taking the symbol information to be determined as the input of the music representation model, includes:

determining a predicted value of the symbol information to be determined according to the decoding vector of the symbol information to be determined;

and modifying the predicted value of the symbol information to be determined according to the generation parameter to obtain the symbol information to be determined.

7. The method of claim 6, wherein when the generating parameter includes a phrase structure of the musical piece, the modifying the predicted value of the symbol information to be determined according to the generating parameter to obtain the symbol information to be determined includes:

determining the symbol information to be determined as the first symbol information in the new phrase, and establishing a new phrase;

8. The method of claim 7, wherein when the generating parameter further includes chord progression information of the musical piece, the modifying the predicted value of the symbol information to be determined according to the generating parameter to obtain the symbol information to be determined further includes:

and modifying the predicted value of the chord or the starting position of the note of the symbol information to be determined according to the name of the chord in the chord progression information.

9. The method according to any one of claims 6-8, wherein the output module comprises a linear layer and a softmax layer, and wherein determining the predicted value of the symbol information to be determined based on the decoded vector of the symbol information to be determined comprises:

mapping the decoding vector of the symbol information to be determined into output vectors of all music attributes of the symbol information to be determined through the linear layer;

and processing the output vector of each music attribute of the symbol information to be determined through the softmax layer to obtain probability distribution of each music attribute of the symbol information to be determined, determining predicted values of each music attribute of the symbol information to be determined according to the probability distribution of each music attribute of the symbol information to be determined, wherein the predicted values of each music attribute of the symbol information to be determined form predicted values of the symbol information to be determined.

10. The method according to claim 9, wherein the processing the output vector of each music attribute of the symbol information to be determined through the softmax layer to obtain a probability distribution of each music attribute of the symbol information to be determined, and determining the predicted value of each music attribute of the symbol information to be determined according to the probability distribution of each music attribute of the symbol information to be determined includes:

processing the output vector of the symbol information category of the symbol information to be determined through the softmax layer to obtain probability distribution of the symbol information category of the symbol information to be determined, and determining a predicted value of the symbol information category of the symbol information to be determined according to the probability distribution of the symbol information category of the symbol information to be determined;

and processing the output vector of the second type of music attribute of the symbol information to be determined through the softmax layer to obtain probability distribution of the second type of music attribute, and determining a predicted value of the second type of music attribute according to the probability distribution of the second type of music attribute.

11. The method of claim 2, wherein the music generation model is further configured to:

12. The method as recited in claim 11, further comprising:

and generating the music ending symbol information according to the phrase structure of the musical composition or the maximum symbol information quantity of the musical composition, and taking the music ending symbol information as the current symbol information.

13. The method of claim 2, wherein the autoregressive decoding network comprises L decoders, L being greater than or equal to 1;

Each decoder calculates the attention information of the symbol information to be determined using the structured attention method.

14. The method of claim 13, wherein the decoder comprises a structured attention module and a feed forward network;

15. A method of training a music generation model for use in training a music generation model as claimed in any one of claims 1 to 14, the method comprising:

acquiring a plurality of training samples, wherein the training samples are actual symbol sequences of a musical piece, the musical piece is composed of a plurality of phrases, the actual symbol sequences comprise actual symbol information of each phrase of the musical piece, and the actual symbol information comprises a plurality of musical attributes;

training the music generation model by using the training sample to obtain a predicted symbol sequence of the training sample, wherein the predicted symbol sequence comprises predicted symbol information of the musical compositions, which are sequentially determined by the music generation model;

Calculating a loss value of the training sample according to the actual symbol sequence and the predicted symbol sequence of the training sample;

calculating a loss value of the music generation model using the loss values of the plurality of training samples;

and updating parameters of the music generation model according to the loss value of the music generation model.

16. A musical piece generating apparatus, comprising:

the starting module is used for generating starting symbol information of the musical composition according to the generating instruction of the musical composition;

a determining module, configured to input the start symbol information into a music generating model, and sequentially determine a plurality of symbol information of the musical piece, where each symbol information includes a plurality of musical attributes, where the music generating model is configured to calculate attention information of the symbol information to be determined according to position information of the symbol information to be determined in the musical piece, and determined symbol information and determined phrases that are located before the symbol information to be determined, and calculate the symbol information to be determined according to the attention information of the symbol information to be determined, where the structured attention method includes: determining attention information of the symbol information to be determined according to vectors of all non-phrase ending symbol information of a strong correlation phrase of the current phrase to which the symbol information to be determined belongs, vectors of phrase ending symbol information of a weak correlation phrase of the current phrase and vectors of determined symbol information of the current phrase, wherein the strong correlation phrase and the weak correlation phrase of the current phrase belong to the determined phrase, and the phrase ending symbol information is last symbol information of each phrase and is used for representing music information of the whole phrase;

And the generation module is used for generating the music score of the musical composition according to the symbol information of the musical composition.

17. A training device for training a music generation model for use in any of claims 1-14, the device comprising:

the system comprises an acquisition module, a storage module and a storage module, wherein the acquisition module is used for acquiring a plurality of training samples, the training samples are actual symbol sequences of a musical composition, the musical composition is composed of a plurality of phrases, the actual symbol sequences comprise actual symbol information of each phrase of the musical composition, and the actual symbol information comprises a plurality of musical attributes;

the training module is used for training the music generation model by using the training sample to obtain a predicted symbol sequence of the training sample, wherein the predicted symbol sequence comprises predicted symbol information of the musical compositions which are sequentially determined by the music generation model;

the loss calculation module is used for calculating a loss value of the training sample according to the actual symbol sequence and the predicted symbol sequence of the training sample;

the loss calculation module is further used for calculating loss values of the music generation model by using the loss values of the training samples;

And the parameter updating module is used for updating the parameters of the music generation model according to the loss value of the music generation model.

18. An electronic device, comprising:

a processor and a memory for storing a computer program, the processor being for invoking and running the computer program stored in the memory to perform the method of any of claims 1 to 14 or 15.

19. A computer readable storage medium storing a computer program for causing a computer to perform the method of any one of claims 1-14 or 15.