CN118609529A

CN118609529A - Method and system for generating AI music works

Info

Publication number: CN118609529A
Application number: CN202410857320.2A
Authority: CN
Inventors: 许楠
Original assignee: Pioneer Cloud Computing Shanghai Co ltd
Current assignee: Pioneer Cloud Computing Shanghai Co ltd
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2024-09-06

Abstract

The invention discloses a method and a system for generating AI musical compositions, comprising the following steps: constructing a melody vocabulary including section separators and music theory indicators; training the autoregressive model based on the digital staff spectrum to obtain a music generation model; the method comprises the steps of establishing a music score based on a music score vocabulary and user requirements, wherein the music score comprises a plurality of music score bars, each music score bar starts with a bar starting symbol in a bar isolator and ends with a bar ending symbol in the bar isolator, and at least one sound part is arranged between the bar starting symbol and the bar ending symbol; constructing a part-of-speech transition matrix for restricting sampling rules between two adjacent parts of speech; a music score is input into the music generation model, and a random sampling process of the music generation model is updated based on the part-of-speech transition matrix to generate a music score. The grammar structure is simple, the autoregressive model is easy to train, and the generated result has good grammar robustness.

Description

Method and system for generating AI music works

Technical Field

The invention belongs to the technical field of computers, and particularly relates to a method and a system for generating an AI musical composition.

Background

The staff is a classical music notation mode and can express the core concept of the music work in a concise way. With the rapid development of artificial intelligence, music generation type artificial neural networks based on an autoregressive model have been developed for a long time in recent years. The combination of language models based on artificial neural networks and music serialization schemes as described in the following paragraphs to generate efficient music charts meeting user requirements has become an operational approach for music creators, but the following problems still remain in the specific application process.

Oore et al 2018 proposes a MIDI-like music language for generating an AI for expressing a sequence of musical performance events, which, as shown in fig. 1, encodes a performance signal of music, which cannot be displayed directly as a music score.

The musicXML based on XML is a general standard format of digital staff, the scheme is suitable for parsing and generating based on a traditional algorithm of rules, but has complicated expression form of music symbols, and has a great amount of long-term structural context dependence in XML grammar, and is not suitable for training a deep learning model.

Lilypond is a special language for digital staff, which is compact relative to MusicXML, but still has a structured context dependence, and when used directly for deep learning, the vocabulary in the text encoder is still not compact enough due to some grammatical complexity.

MusicGen is a music generation deep learning model proposed by Meta AI, which is based on residual error and vector quantization, which is a continuous signal discretization method that can discretize an audio signal into an integer sequence for training an autoregressive model. But this scheme encodes the audio signal of music instead of the music score.

Disclosure of Invention

Aiming at the problems that a music expression method in the prior art cannot be directly displayed as a music score and a music generation model has lower model precision due to data conversion reasons, the invention provides a method and a system for generating an AI musical composition, which are inspired by MIDI-like languages, and a set of language model token word list with reference Lilypond is established. In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method for generating AI musical compositions, comprising the steps of:

s1, constructing a music score vocabulary comprising M section separators and N music theory indicators based on music theory knowledge, wherein the section separators and the music theory indicators are respectively represented by different characters;

S2, collecting a digital staff spectrum, and training an autoregressive model based on the digital staff spectrum to obtain a music generation model;

s3, a music score is built based on the music score vocabulary constructed in the step S1 and the user demand, the music score comprises a plurality of music score bars, each music score bar starts with a bar starting symbol in a bar isolator and ends with a bar ending symbol in the bar isolator, and at least one sound part is arranged between the bar starting symbol and the bar ending symbol;

S4, constructing a part-of-speech transition matrix for restricting sampling rules between two adjacent parts-of-speech, wherein the parts-of-speech comprises a section separator and a music theory indicator;

S5, inputting the music score and the music piece into a music generation model, and simultaneously updating a random sampling process of the music generation model based on the part-of-speech transition matrix to acquire the music score corresponding to the user requirement.

In step S3, when ζ >1, a vocal separator is further disposed between two adjacent vocal parts, and ζ represents the number of vocal parts between the measure start and measure end of the same track measure.

In step S3, word separator is arranged between the section initiator and the sound part, and between the section terminator and the sound part of the same score section.

In step S3, the sound part includes a plurality of context items and a plurality of event items, each of which includes a chord for describing volume, a duration for describing musical event duration, and an event suffix for describing musical event modifier;

the context items comprise a mark, a clef, a sound track serial number and a key, and the mark, the clef, the sound track serial number and the key belong to one of music theory indicators.

The chord comprises a phoneme and a plurality of octaves, wherein the phoneme comprises a sound level and at most one sound variation; the octave lifting, the tone level and the tone variation are all one of music theory indicators.

The section separator comprises PAD as a filler, BOM representing a section start character, EOM representing a section end character and VB representing a sound part separator;

the music theory indicator comprises a music track serial number, a clef, a key, a numerator of the key, a denominator of the key, a sound level, a sound variation, an octave increase, an octave decrease, a basic value, a floating point, fu Gang, a rest, an expression and a skill;

The expression of the part-of-speech transition matrix is as follows:

Where P _i,j =1 indicates that a transition from the part of speech of the present row to the part of speech of the present column is allowed, and P _i,j =0 indicates that a transition from the part of speech of the present row to the part of speech of the present column is not allowed, and P _i,j indicates an element of the j-th column of the i-th row in the part of speech transition matrix P; parts of speech corresponding to the 1 st row and the 1 st column in the part of speech transition matrix P are PAD, parts of speech corresponding to the 2 nd row and the 2 nd column are BOM, parts of speech corresponding to the 3 rd row and the 3 rd column are EOM, parts of speech corresponding to the 4 th row and the 4 th column are VB, parts of speech corresponding to the 5 th row and the 5 th column are audio track numbers, parts of speech corresponding to the 6 th row and the 6 th column are clefs, parts of speech corresponding to the 7 th row and the 7 th column are transfer numbers, parts of speech corresponding to the 8 th row and the 8 th column are molecules of a beat number, parts of speech corresponding to the 9 th row and the 9 th column are denominators of the beat number, parts of speech corresponding to 10 th row and 10 th column are the tone scale, parts of speech corresponding to 11 th row and 11 th column are the change sound, parts of speech corresponding to 12 th row and 12 th column are the decrease octave, parts of speech corresponding to 13 th row and 13 th column are the increase octave, parts of speech corresponding to 14 th row and 14 th column are basic value, parts of speech corresponding to 15 th row and 15 th column are floating point, parts of speech corresponding to 16 th row and 16 th column are Fu Gang, parts of speech corresponding to 17 th row and 17 th column are rest, and parts of speech corresponding to 18 th row and 18 th column are expression and technique.

A system for generating AI musical compositions, comprising:

Vocabulary construction module: the music score vocabulary is used for constructing a music score vocabulary comprising M section separators and N music score indicators based on music score knowledge, wherein the section separators and the music score indicators are respectively represented by different characters;

And a model generation module: the method comprises the steps of collecting a digital staff, and training an autoregressive model based on the digital staff to obtain a music generation model;

Music score and music establishment module: the method comprises the steps that a music score vocabulary and user requirements are used for building music score music based on the music score vocabulary of a vocabulary building module, the music score music comprises a plurality of music score bars, each music score bar starts with a bar starting symbol in a bar isolator and ends with a bar ending symbol in the bar isolator, and at least one sound part is arranged between the bar starting symbol and the bar ending symbol;

matrix storage module: the part-of-speech transition matrix is used for storing part-of-speech transition matrices, wherein the part-of-speech transition matrices are used for constraining sampling rules between two adjacent parts of speech, and the parts of speech comprise part separators and music theory indicators;

And a music score generating module: the random sampling process for the music generation model is updated based on the part-of-speech transition matrix of the matrix storage module, and the music spectrum corresponding to the user requirement is acquired according to the music spectrum music of the music spectrum music establishment module and the updated music generation model.

The invention has the beneficial effects that:

1. The application can generate AI musical compositions in the form of a staff, omits the complexity attached to the training model in learning the musical performance, can reflect the music theory rules and the essence of the music more directly, and can also be mutually converted with the standard format of the digital staff;

2. the music symbology of the digital staff language Lilypond is compared, is easy to develop and can be mutually converted with the format of Lilypond, and is a music language scheme which is convenient to read and listen at the same time;

3. The grammar structure is simple, the description mode of the music score language is simple, the autoregressive model training is easy, if the music score dataset is created by the language, the music score dataset can be directly used for training the autoregressive model, and the generated result has good grammar robustness;

4. The music score can also be directly applied to the training of the identification network model for identifying the music score image, so that the training precision of the model is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of a prior art generation of music based on MIDI-like music languages;

FIG. 2 is a schematic diagram of a music score generated in accordance with the present application;

FIG. 3 is a second diagram of a music score generated in accordance with the present application;

FIG. 4 is a training schematic of a music generation model;

FIG. 5 is a schematic diagram of a method and application for resolving a generated spectrum of the present application;

FIG. 6 is a structural analysis diagram of an exemplary atlas subsection;

Fig. 7 is a graph of a score corresponding to a score bar in the graph.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without any inventive effort, are intended to be within the scope of the invention.

Autoregressive model: is a method of statistically processing time series, where the value of the next element can be predicted using the pre-element in a series.

OMR: (Optical Music Recognization, optical music recognition), which can be analogous to OCR, is a technique for recovering a digital score from an image of a staff.

BNF: (Backus Normal Form, backus-Norwalk), a class of formal languages described by context-free grammar may be used to define a set of grammatical rules for the formal language.

Token: may be translated into a "token" representing a unit in the sequence processed by the language model. In natural language understanding, token is the smallest unit of segmentation in text preprocessing, and a token may be a word, root, or kanji.

The above description is directed to the prior art used in the present application.

A method for generating AI musical compositions, comprising the steps of:

In this embodiment, m=4, n=14. Four section separators are used to separate different sections during the music score production process, the section separators including PAD as filler, BOM (begin of measure) representing section start, EOM (end of m easure) representing section end, VB (voice boundary) representing vocal separators. In this embodiment, zero is used as the filler. Each section isolator acts as a token.

Thirteen music theory expressions are as follows:

The Token corresponding to each music expression can be set according to the requirement condition of the music. In this embodiment, the Token settings corresponding to the music instrument identifiers and the meanings thereof are as follows: each five-line in the total spectrum vertical structure of the five-line spectrum is called a track, and the five-line is sequenced from top to bottom, and the track positions occupied by the music instrument indicators can be indicated by the track serial numbers (Staff), and the corresponding Token comprises S1, S2 and S3, for example: s2 represents the second track. The Token corresponding to clef (Clef) includes Cg and Cf, cg is the treble clef, and Cf is the bass clef, where the treble clef is the G clef and the bass clef is the F. The Token corresponding to the Key number (Key) comprises K0, K1, K2, K3, K4, K5, K6, K_1, K_2, K_3, K_4, K_5, K_6, K0 represents a non-lifting Key number, namely a major Key or a minor Key, K1, K2, K3, K4, K5 and K6 represent 1-6 liter marks, respectively, and K_1, K_2, K_3, K_4, K_5 and K_6 represent 1-6 drop marks, respectively. The Token corresponding to the number of the molecule comprises TN1, TN2, TN3, TN4, TN5, TN6, TN7, TN8, TN9, TN10, TN11 and TN12, which respectively represent 1-12 molecules of the number. The Token corresponding to the sign as the denominator comprises TD2, TD4 and TD8. The Token corresponding to the Pitch (i.e., the note name) includes a, b, c, d, e, f, g corresponding to seven notes of note names a-g, respectively. The Token corresponding to the variation (ACCIDENTAL) includes As, af, ass, aff, which sequentially represents an up symbol, a down symbol, a re-up symbol, and a re-down symbol.

The Token corresponding to the rising octave is Osup which is used for recording the marks used for playing the upward pure octave in the melody, and the Token corresponding to the lowering octave is Osub which is used for recording the marks used for playing the downward pure octave in the melody; the present application refers to the relative pitch pattern of Lilypond by default to determine the group of tones by nearest tone from the previous pitch when octaves up-down (including up-octaves and down-octaves) do not occur. The Token corresponding to the basic value, namely the basic value type comprises D1, D2, D4, D8, D16, D32 and D64 which respectively represent the whole note, the half note, the quarter note, the eighth note, the sixteenth note and the thirty-half note, and the basic value type sign is divided into powers of 2. The Token corresponding to the floating point is Dot, the time value representing the decorated music event is increased by 1/2, and when a plurality of floating points are overlapped, the total time value of the music event is 2-2 ^-d times of the basic time value (Division), wherein d is the floating point number of the current music event. In the present application, the musical event refers to a set of tones, chords and/or rest on a music score.

Fu Gang (Beam) corresponds to a Token comprising Bl, br, representing a bar start and Fu Gang end, respectively, the Fu Gang start representing that the modified musical event is at the left end of the bar and the Fu Gang end representing that the modified musical event is at the right end of the bar. The Token corresponding to the Rest includes Rest, RSpace, rest represents that the modified musical event is a Rest, RSpace represents that the modified musical event is a blank Rest, and the Rest is used for occupying the time value. The Token corresponding to the expression and skill includes EslurL, eslurR, etie, earp, eslurL is a rounded line (slur) initiator, representing that the modified musical event is at the left end of the rounded line; eslurR is a rounded line end symbol representing that the modified musical event is at the right end of the rounded line; etie is a damper (tie) modifier, representing a damper on the modified musical event; earp represents the modified musical event with an arpeggio (arpeggio) sign.

S2, as shown in FIG. 4, collecting a digital staff spectrum, and training based on the digital staff spectrum by using an autoregressive model to obtain a music generation model;

the digital staff can be obtained by manually typing through a music score editing software. The music score editor edits and makes the digital staff of the existing musical composition through staff professional editing software (such as File, museScore, etc.), and then exports the music score, namely the digital staff, into a musicXML format. musicXML is a general description format, and is communicated with the music score, lilypond edit text of the application, and can be used for format conversion.

The digital staff can also be obtained through staff images, specifically, the staff images can be obtained by photographing a large number of existing music spectrums and utilizing an OMR engine identification method to identify the staff images. Training the autoregressive model by taking all the digital staff as a training set can obtain a music generation model.

OMR is a method of generating a digital staff from a music spectrum image identification. In particular, OMR implementations have a variety of technical routes including: a. positioning to obtain a semantic point set of a music symbol from an image, and analyzing by a topological structure to obtain music spectrum data; b. the image is converted into a language description of a certain kind of music spectrum directly by using a cross-modal model of computer vision plus language (V+L). This prior art is not the protection focus of the present application, and the present embodiment will not be described in detail.

S3, a music score is built based on the music score vocabulary constructed in the step S1 and the user requirements, the music score comprises a plurality of music score bars, each music score bar starts with a bar start symbol and ends with a bar end symbol, and at least one sound part is arranged between the bar start symbol and the bar end symbol;

preferably, when a plurality of vocal parts are provided between the section start symbol and the section end symbol, a vocal part separator is further provided between two adjacent vocal parts.

Preferably, a word element separator is arranged between any two adjacent symbols such as a section initiator, a section terminator, a sound part separator and the like, and in the embodiment, a space is used as the word element separator.

Such as: BOM (sound part) EOM is a music score bar, BOM (sound part 1) VB (sound part 2) EOM is also a music score bar, and BOM (sound part 1) VB (sound part 2) VB (sound part 3) EOM is also a music score bar.

Each soundtrack includes a number of context items and a number of event items, the event items including a chord (chord) for describing a low pitch, a duration (duration) for describing a musical event duration, and an event suffix (post_events) for describing a musical event modifier. When making the sounder, the time value is located between the chord and the event suffix, and the time value comprises a basic time value and lambda floating points, wherein lambda is more than or equal to 0.

The chord includes a number of pitches, each pitch including a phoneme and a number of octaves, each phoneme including a level and at most one variation. In addition, the pitch can also be increased and decreased without octaves, and the number of the pitches is set according to specific requirements. The absolute value of a single tone on the pitch axis of the melody can be determined according to the pitch, and the phonemes define the position of the pitch of a single tone within the natural set to which it belongs, the natural set referring to a set of 7 consecutive pitch bits of c, d, e, f, g, a, b defined by "intervals" and "lines" in the staff. The inflection defines the ascending/descending semitone attribute at each level position.

The post_events are optional, and are used for describing modification components of musical events, and the number of the modification components can be zero, or can be any one or more of rest, fu Gang, expression and skill.

The context item describes the current state of the current sound part, its validity period until the next similar context item within the current sound part appears. Each context item is a single token, which may be a track number, a key, a signature (numerator or denominator), or a clef, in a track section, the signature and key typically appear fixed once, and the track number may appear multiple times.

The score bar in fig. 6 is "BOM k_1TN4 TD4 Cg S1 c D2 Dot EslurL g Osup b Af D D4 EslurR EOM", where "k_1TN4 TD4 Cg S1 c D2 Dot EslurL g Osup b Af D D4 EslurR" is a sound part, "k_1tn4 td4cg S1" each token is a context item, "c D2 Dot EslurL g Osup b Af D D4 EslurR" is an event item, "c" is a pitch, "D2 Dot" is a time value, "EslurL" is an event suffix, "g Osup" is a pitch, "b Af" is a pitch, "D4" is a time value, "EslurR" is an event suffix, and the score corresponding to the score bar in fig. 6 is shown in fig. 7.

S4, constructing a part-of-speech transition matrix for restricting sampling rules between adjacent different parts-of-speech, wherein the parts-of-speech comprises a section separator and a music theory indicator;

in the prior art, the text is generated by an autoregressive model based on a random sampling process, and in order to avoid grammar errors in the random sampling process, a set of adjacency transition rules among different parts of speech are defined to restrict the sampling process. The expression corresponding to the part-of-speech transition matrix is as follows:

Wherein p _i,j∈P,p_i,j represents an element of an ith row and a jth column in a part of speech transition matrix, parts of speech corresponding to a1 st row and a1 st column in the part of speech transition matrix are PAD, parts of speech corresponding to a 2 nd row and a 2 nd column are BOM, parts of speech corresponding to a 3 rd row and a 3 rd column are EOM, parts of speech corresponding to a 4 th row and a 4 th column are VB, parts of speech corresponding to a5 th row and a5 th column are track numbers, parts of speech corresponding to a 6 th row and a 6 th column are clefs, parts of speech corresponding to a 7 th row and a 7 th column are key numbers, parts of speech corresponding to an 8 th row and an 8 th column are molecules of a beat number, parts of speech corresponding to a 9 th row and a 9 th column are denominators of the beat number, parts of speech corresponding to 10 th row and 10 th column are the tone scale, parts of speech corresponding to 11 th row and 11 th column are the change tone, parts of speech corresponding to 12 th row and 12 th column are Osub of decreasing octave, parts of speech corresponding to 13 th row and 13 th column are Osup of increasing octave, parts of speech corresponding to 14 th row and 14 th column are basic values, parts of speech corresponding to 15 th row and 15 th column are floating points, parts of speech corresponding to 16 th row and 16 th column are Fu Gang, parts of speech corresponding to 17 th row and 17 th column are rest, and parts of speech corresponding to 18 th row and 18 th column are expressions and skills.

The part-of-speech transition matrix is a matrix of (M+N) x (M+N). The element values in the part-of-speech transition matrix indicate whether a transition from "part-of-speech at line" to "part-of-speech at column" is allowed (1 indicates allowed, 0 indicates not allowed). Such as: the value of row 8, column 9 is 1, so the expression shaped as TN3 TD4 is legal (meaning 3/4 slap); and the value of row 9 and column 8 is 0, the expression shaped like TD4 TN3 is illegal.

S5, inputting the music score music piece established in the step S3 into the music generation model in the step S2, and updating a random sampling process of the music generation model based on the part-of-speech transition matrix obtained in the step S4 to obtain a music score corresponding to the user requirement;

the autoregressive model in the prior art is an unconditional generation process, and the diversity of the generated results comes from the introduction of a random sampling process in the generation. The result of the random sampling output may be controlled by the seed of the pseudorandom number generator. Here "unconditional" means that the model has no additional control parameters (such as pre-cues, styles, etc.). The application further defines the random sampling process based on the part-of-speech transition matrix. Specifically, after a music score is input into a music generation model, the music generation model first calculates a part-of-speech predictive probability vector Then, a line vector corresponding to the current part of speech is screened out from the part of speech transition matrix to be used as a first mask vector of the predicted part of speechFor the first mask vector according to the number of Token corresponding to each part of speechThe number of elements of (a) is expanded to obtain a second mask vectorVector the second maskAnd part-of-speech predictive probability vectorsCorrespondingly multiplying to obtain updated part-of-speech predictive probability vectorThen the normalized part-of-speech predictive probability vector is obtained through softmax functionAnd then randomly sampling based on the normalized part-of-speech predictive probability vector. For example, the current part of speech is BOM, which corresponds to the first mask vectorThe second mask vector after expansion is

Updated part-of-speech predictive probability vectorThe calculation formula of (2) is as follows:

In the method, in the process of the invention, Representing updated part-of-speech predictive probability vectorsThe (c) is a group of elements,Representing part-of-speech predictive probability vectorsThe (c) is a group of elements,Representing the kth element in the mask vector, m is the mask operation function.

The normalized part-of-speech predictive probability vectorThe calculation formula of (2) is as follows:

where T represents the sampling temperature and its value determines the degree of randomization of the sampling.

Such as: the bar of the music score is BOM K0 TN4 TD 4S 1 Cg c D1 EOM, and the corresponding music score is shown in FIG. 2. When the score bar is "BOM K4 TN3 TD8 S1 Cg f As Osup D32 Bl d As D32 b D32 g As D32 Br S2 f As Osub D32 Bl d As D32 b D32 g As D32 Br S1 d As Osup D32 Bl S2 g As D32 Osub S1 b D32 d As D32Br VB S2 Cf b D8 S1 d As Osup D8 EslurL f As D8 EslurR VB S2 Cf b Osub d As D8 b D8 b D8EOM", the corresponding score is shown in fig. 3.

As shown in fig. 5, a music composition generated based on the music vocabulary defined by the present application may be parsed by the BNF grammar, and for a language defined by the BNF grammar, a corresponding generic music data format may be generated by a parse generator (e.g., BISON, YACC, etc.), and may be played and presented by other digital staff tools.

The embodiment of the application also provides a system for generating the AI musical composition, which comprises

The embodiment of the application also provides electronic equipment, which comprises a processor and a memory, wherein the memory stores a computer program, and the computer program executes the steps of the method for generating the AI musical composition when being executed by the processor.

The embodiment of the application also provides a computer readable storage medium, wherein the storage medium stores a computer program, and the computer program executes the steps of the method for generating AI musical compositions. In particular, the storage medium may be a general-purpose storage medium, such as a removable disk, a hard disk, or the like, on which a computer program is executed that is capable of executing the embodiments of the above-described method for generating AI musical compositions.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. A method for generating an AI musical composition, comprising the steps of:

2. The method for producing an AI musical piece according to claim 1, wherein in step S3, when ζ >1, a vocal separator is further provided between two adjacent vocal parts, and ζ represents the number of vocal parts between a measure start and a measure end of the same track measure.

3. The method for producing AI musical pieces as claimed in claim 1, wherein in step S3, word element separators are provided between the section start and the sound section, and between the section end and the sound section of the same score section.

4. The method for generating AI musical compositions according to claim 1, characterized in that in step S3, the sound parts each comprise a number of context items and a number of event items, each event item comprising a chord for describing the volume, a duration for describing the duration of a musical event and an event suffix for describing the modifier of a musical event;

5. The method for generating AI musical compositions as claimed in claim 4, wherein the chord includes a phoneme and a plurality of octaves, the phoneme including a level and at most one variation; the octave lifting, the tone level and the tone variation are all one of music theory indicators.

6. The method for generating AI musical pieces as claimed in claim 1, wherein the section separator includes PAD as a filler, BOM representing a section start character, EOM representing a section end character, VB representing a vocal section separator;

The expression of the part-of-speech transition matrix is as follows:

7. A system for generating AI musical compositions, comprising: