CN115712729A

CN115712729A - Interactive music generation method and device based on compilation

Info

Publication number: CN115712729A
Application number: CN202211465790.1A
Authority: CN
Inventors: 高成英; 黄靖雯; 许遵楠; 唐一琳; 梁潇
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2022-11-22
Filing date: 2022-11-22
Publication date: 2023-02-24

Abstract

The invention discloses an interactive compiling-based music generation method and device, wherein the method comprises the steps of obtaining a music description language text and a music style description text; carrying out music configuration classification processing on the music style description text through a pre-trained configuration classification model to obtain a configuration text; carrying out recomposition processing on the music description language text according to the configuration text to obtain a text to be compiled; and performing music compiling generation processing on the text to be compiled to obtain a compliable music file. The embodiment of the invention can create music without learning music theory knowledge, and can be widely applied to the technical field of artificial intelligence.

Description

Interactive music generation method and device based on compilation

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to an interactive music generation method and device based on compilation.

Background

Currently, most music creation applications are based on simulating real-world instruments, such as piano simulators, guitar simulators, and the like. Such applications often require users to have certain experience of using the musical instrument while knowing the musical theory, and the served objects are mainly musicians who have good musical instrument playing level. For general music enthusiasts who lack the knowledge of music theory and experience in using musical instruments, such simulation-based music composition applications often fail to help them create a satisfactory work. The mode of generating music in the related art is mainly completed by people performing artificial creation, for example, hip-hop music may be compiled by a professional hip-hop singer. But for people without a music base, the ability to compose music is not provided at all. In view of the above, there is a need to solve the technical problems in the related art.

Disclosure of Invention

In view of this, embodiments of the present invention provide an interactive music generation method and apparatus based on compilation, so as to implement music creation without mastering music theory knowledge.

In one aspect, the present invention provides an interactive compilation-based music generation method, comprising:

acquiring a music description language text and a music style description text;

carrying out music configuration classification processing on the music style description text through a pre-trained configuration classification model to obtain a configuration text;

carrying out recomposition processing on the music description language text according to the configuration text to obtain a text to be compiled;

and performing music compiling generation processing on the text to be compiled to obtain a compliable music file.

Optionally, the obtaining a configuration text by performing music configuration classification processing on the music style description text through a pre-trained configuration classification model includes:

the configuration classification model comprises a text encoder, a music generator and an audio discriminator;

preprocessing the music style description text through the text encoder to obtain a text hidden vector;

performing music generation processing on the text hidden vector through the music generator to obtain music fragments;

and carrying out audio discrimination processing on the music segments through the audio discriminator to obtain a configuration text.

Optionally, before the music configuration classification processing is performed on the music style description text through the pre-trained configuration classification model to obtain a configuration text, training the configuration classification model includes:

acquiring a training text;

inputting the training text into an initialized configuration classification model to obtain a classification result;

and freezing the weight of the music generator in the configuration classification model, and optimizing a text encoder in the configuration classification model according to the classification result to obtain the trained configuration classification model.

Optionally, the adapting the music description language text according to the configuration text to obtain a text to be compiled includes:

carrying out configuration analysis processing on the configuration text to obtain a text configuration item;

and updating the music description language text according to the text configuration item to obtain a text to be compiled.

Optionally, the music compiling and generating processing on the text to be compiled to obtain a compliable music file includes:

performing lexical analysis processing on the text to be compiled to obtain a symbol sequence set;

and carrying out syntactic analysis processing on the symbol sequence set, and executing semantic actions corresponding to the syntactic descriptions obtained by analysis to obtain the compilable music text.

Optionally, the lexical analysis processing on the text to be compiled to obtain a symbol sequence set includes:

and performing lexical analysis processing on the text to be compiled, and identifying the symbol of the text to be compiled through a lexical analyzer to obtain a symbol sequence set, wherein the symbol sequence set at least comprises punctuation marks, reserved words, phonetic names, integers and identifiers.

Optionally, the parsing the symbol sequence set and executing a semantic action corresponding to the parsed syntax description to obtain a compilable music text includes:

configuring and analyzing the header of the symbol sequence set through a syntax analyzer to obtain a global configuration item;

initializing a music digital interface audio track according to the global configuration item;

reducing the rest part of the symbol sequence set through the syntactic analyzer to obtain a music measure;

calculating the time of the note event of the music measure to obtain a music digital interface event;

adding the musical digital interface event to the musical digital interface audio track to obtain a compilable musical text.

On the other hand, an embodiment of the present invention further provides an interactive compiling-based music generating apparatus, including:

the first module is used for acquiring a music description language text and a music style description text;

the second module is used for carrying out music configuration classification processing on the music style description text through a pre-trained configuration classification model to obtain a configuration text;

the third module is used for carrying out recomposition processing on the music description language text according to the configuration text to obtain a text to be compiled;

and the fourth module is used for performing music compiling generation processing on the text to be compiled to obtain a compliable music file.

On the other hand, the embodiment of the invention also discloses an electronic device, which comprises a processor and a memory;

the memory is used for storing programs;

the processor executes the program to implement the method as described above.

On the other hand, the embodiment of the invention also discloses a computer readable storage medium, wherein the storage medium stores a program, and the program is executed by a processor to realize the method.

In another aspect, an embodiment of the present invention further discloses a computer program product or a computer program, where the computer program product or the computer program includes computer instructions, and the computer instructions are stored in a computer-readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and the computer instructions executed by the processor cause the computer device to perform the foregoing method.

Compared with the prior art, the technical scheme adopted by the invention has the following technical effects: the embodiment of the invention obtains a music description language text and a music style description text; carrying out music configuration classification processing on the music style description text through a pre-trained configuration classification model to obtain a configuration text; carrying out recomposition processing on the music description language text according to the configuration text to obtain a text to be compiled; and performing music compiling generation processing on the text to be compiled to obtain a compliable music file. The embodiment of the invention can enable the user to compose music without learning the music theory knowledge.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a flow chart of an interactive compilation-based music generation method provided by an embodiment of the present application;

FIG. 2 is a diagram illustrating syntax definition of a music description language according to an embodiment of the present application;

FIG. 3 is a diagram of an example of a compiled music description language text provided by an embodiment of the present application;

fig. 4 is a neural network model framework diagram of a configuration classification model provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

First, several terms referred to in the present application are resolved:

musical instrument digital interface: (Mus ica I transtherapeutic digital I interface, MI DI): is the most extensive standard format of music in the composition world and can be called 'music score understood by computer'. It records music with digital control signals of the notes. A complete piece of mbi music is only a few tens of KB large and can contain tens of music tracks. Almost all modern music is synthesized using MI di plus a library of timbres.

Long Short Term Memory recurrent neural network (LSTM): is a special recurrent neural network. Unlike a typical feedforward neural network, the LSTM can analyze the input using a time series.

Extended bacauss-form (EBNF): is a meta-grammar notation that expresses context-free grammars as a formal way of describing computer programming languages and formal languages. It is an extension of the basic Backus-norm (BNF) meta-syntax notation.

In order to provide a more convenient tool for music creation and popularize artistic creation and artistic aesthetic into the daily life of the public, a computer-aided creation technology is developed in recent years, and can effectively help the public create various music works, thereby enriching the cultural life of people and having various creative applications in the fields of entertainment and multimedia. However, there is a lack of technology that can generate music directly from natural language described by human beings, and users who lack musical knowledge would prefer to use such convenient technology to implement control of the music creation process so as to generate their desired target music.

In view of the above, referring to fig. 1, an embodiment of the present invention provides an interactive compilation-based music generation method, including:

s101, acquiring a music description language text and a music style description text;

s102, carrying out music configuration classification processing on the music style description text through a pre-trained configuration classification model to obtain a configuration text;

s103, carrying out recomposition processing on the music description language text according to the configuration text to obtain a text to be compiled;

and S104, performing music compiling generation processing on the text to be compiled to obtain a compliable music file.

In the embodiment of the invention, for a user lacking in optimistic knowledge and programming knowledge, an existing music description language text is selected and obtained, then a piece of music style description text which expresses an expected music style by using a natural language is given, the music style description text is analyzed by a configuration classification model to judge a corresponding configuration audio and determine a group of configuration values, and finally, a configuration part in the selected music description language text is rewritten to obtain a new compilable music text. Referring to fig. 2, fig. 2 is a diagram illustrating an EBNF syntax definition of a music description language that can be compiled by a compiler according to an embodiment of the present invention. Referring to fig. 3, fig. 3 is a music presentation language text according to the above syntax, from which a compiler can compile to generate a playable md I file. In the embodiment of the present invention, it is conceivable that the user may also write a music description language text according to the syntax by himself, and perform music compiling generation processing on the music description language text to obtain a compliable music file.

Further as a preferred embodiment, the performing music configuration classification processing on the music style description text by using a pre-trained configuration classification model to obtain a configuration text includes:

and carrying out audio discrimination processing on the music fragments through the audio discriminator to obtain a configuration text.

In the embodiment of the invention, the configuration classification model comprises a text encoder, a music generator and an audio discriminator, the text encoder can be obtained by training based on a bidirectional LSTM model, the text encoder is utilized to preprocess a music style description text to generate a text hidden vector, the text hidden vector is input into a pre-trained music generation model based on a linear Transformer to obtain a music fragment, the music fragment is analyzed according to the audio discriminator, and the configuration corresponding to the audio classification result is obtained in a predefined configuration table to obtain the configuration text. As shown in fig. 4, fig. 4 is a neural network model framework diagram of a configuration classification model provided in an embodiment of the present application.

As a further preferred embodiment, before the music configuration classification processing is performed on the music style description text through the pre-trained configuration classification model to obtain a configuration text, training the configuration classification model includes:

acquiring a training text;

In the embodiment of the invention, a music style description text for training is obtained, the training text is input into a configuration classification model for initializing network parameters to obtain a configuration classification result, then weights of a pre-trained linear Transformer-based music generator in the configuration classification model are frozen, only a bidirectional LSTM-based text encoder is finely adjusted, and an Adam optimizer is used for optimizing the text encoder to enable the output of the model to be in a state most consistent with the preset configuration. Setting the learning rate of an Adam optimizer to be 0.0002; the number of iterations for training is 10000, and the training process of the model is completed on 8 Tes l a V100.

Further as a preferred embodiment, the adapting the music description language text according to the configuration text to obtain a text to be compiled includes:

In the embodiment of the invention, the configuration text is subjected to configuration analysis processing to obtain text configuration items such as the travelling speed (BPM), musical instruments, volume, current scale, playing intensity, rhythm and the like of music, and the corresponding configuration of the music description language text is changed according to the text configuration items to obtain the text to be compiled.

Further as a preferred embodiment, the performing music compiling and generating processing on the text to be compiled to obtain a compliable music file includes:

In the embodiment of the invention, lexical analysis is carried out on the text to be compiled through a lexical analyzer, so that bottom-up syntactic analysis of the text is completed, and a symbol sequence set is obtained. The embodiment of the invention provides a syntax analyzer for reducing the obtained symbol sequence from a terminal symbol to a starting symbol step by step according to a syntax, executing semantic actions corresponding to syntax description in the analysis process, and finally generating an M I D I file when the starting symbol is reduced.

Further as a preferred embodiment, the lexical analysis processing on the text to be compiled to obtain a symbol sequence set includes:

In the embodiment of the invention, the lexical analyzer is used for carrying out lexical analysis on the text to be compiled, and the symbols in the text to be compiled are identified to obtain the symbol sequence set. Specifically, a blank symbol and a line end annotation in the text to be compiled are identified through a lexical analyzer, but no symbol is returned to a syntax analyzer; recognizing punctuation marks, reserved words and phonemic names through a lexical analyzer, and returning corresponding marks to a grammar analyzer; recognizing the integer through a lexical analyzer, reporting an error and terminating the program if the digit of the recognized integer is greater than 12 bits, and returning a corresponding symbol to the syntax analyzer if the digit of the recognized integer is not greater than 12 bits; and identifying the identifier through a lexical analyzer, if the identified identifier is longer than 24 characters, reporting an error and terminating the program, and otherwise, returning the corresponding symbol to the grammar analyzer.

As a further preferred embodiment, the parsing the symbol sequence set and executing the semantic action corresponding to the parsed syntax description to obtain the compilable music text includes:

In an embodiment of the present invention, the parser parses the symbol stream, i.e., the set of symbol sequences, output by the lexical parser from front to back. The global configuration item of the symbol stream header is analyzed, and the content of the global configuration item mainly comprises the travel speed (BPM) of the music piece, the used musical instrument (I NSTRUMENT) and the volume (VELOCI TY) according to the grammar definition. After the configuration values of the global configuration are parsed, an MI DI track is initialized in the parser (i.e. an empty track without MI DI events is created, and only the overall traveling speed, instrument timbre and volume are configured as parsing results). The parser continues parsing the symbol stream, progressively reducing the symbols from bottom to top to note events and then from note events to measures according to the grammar. The grammar analyzer calculates the duration of each note event in the section after the section is reduced, generates a note start event and a note end event, and adds the MI DI events to the MI DI audio track; after the grammar analyzer reduces all the sections, the grammar analyzer reduces all the sections to the starting symbol, marks the end of the grammar analyzing process, calls the semantic realizing program to write the MI DI audio track temporarily stored in the grammar analyzer into the MI DI file, and ends the compiling and generating process.

and the fourth module is used for performing music compiling generation processing on the text to be compiled to obtain a compilable music file.

Corresponding to the method of fig. 1, an embodiment of the present invention further provides an electronic device, including a processor and a memory; the memory is used for storing programs; the processor executes the program to implement the method as described above.

Corresponding to the method of fig. 1, the embodiment of the present invention also provides a computer-readable storage medium, which stores a program, and the program is executed by a processor to implement the method as described above.

The embodiment of the invention also discloses a computer program product or a computer program, which comprises computer instructions, and the computer instructions are stored in a computer readable storage medium. The computer instructions may be read by a processor of a computer device from a computer-readable storage medium, and executed by the processor to cause the computer device to perform the method illustrated in fig. 1.

In summary, embodiments of the present invention provide an interactive compilation-based music generation method that enables a general music enthusiast to compose MI DI music in a programmed form after learning a simple grammar. For users who do not have any musicality knowledge and can not use grammar correctly, the users can directly guide a compiler to compile and generate custom-style music on the basis of the existing language text through natural language description. In contrast to other music composition applications based on simulated instruments, the method does not require the user to have experience with the instrument, or even knowledge of the musical theory. The method completely converts the music creation process into a programming process, describes music in a form of combining a music score and a computer program, has high abstraction and thus has potential for developing applications.

In alternative embodiments, the functions/acts noted in the block diagrams may occur out of the order noted in the operational illustrations. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality/acts involved. Furthermore, the embodiments presented and described in the flow charts of the present invention are provided by way of example in order to provide a more comprehensive understanding of the technology. The disclosed methods are not limited to the operations and logic flows presented herein. Alternative embodiments are contemplated in which the order of various operations is changed and in which sub-operations described as part of larger operations are performed independently.

Furthermore, although the present invention is described in the context of functional modules, it should be understood that, unless otherwise indicated to the contrary, one or more of the described functions and/or features may be integrated in a single physical device and/or software module, or one or more functions and/or features may be implemented in separate physical devices or software modules. It will also be appreciated that a detailed discussion of the actual implementation of each module is not necessary for an understanding of the present invention. Rather, the actual implementation of the various functional modules in the apparatus disclosed herein will be understood within the ordinary skill of an engineer given the nature, function, and interrelationships of the modules. Accordingly, those skilled in the art can, using ordinary skill, practice the invention as set forth in the claims without undue experimentation. It is also to be understood that the specific concepts disclosed are merely illustrative of and not intended to limit the scope of the invention, which is defined by the appended claims and their full scope of equivalents.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.

More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). Additionally, the computer-readable medium could even be paper or another suitable medium upon which the program is printed, as the program can be electronically captured, via for instance optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner if necessary, and then stored in a computer memory.

It should be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

While embodiments of the invention have been shown and described, it will be understood by those of ordinary skill in the art that: various changes, modifications, substitutions and alterations can be made to the embodiments without departing from the principles and spirit of the invention, the scope of which is defined by the claims and their equivalents.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An interactive compilation-based music generation method, the method comprising:

acquiring a music description language text and a music style description text;

2. The method according to claim 1, wherein the performing music configuration classification processing on the music style description text through a pre-trained configuration classification model to obtain a configuration text comprises:

3. The method according to claim 1, wherein before the music configuration classification processing is performed on the music style description text through the pre-trained configuration classification model to obtain a configuration text, training the configuration classification model comprises:

acquiring a training text;

4. The method according to claim 1, wherein the adapting the music description language text according to the configuration text to obtain a text to be compiled comprises:

5. The method according to claim 1, wherein the performing music compilation generation processing on the text to be compiled to obtain a compilable music file comprises:

6. The method according to claim 5, wherein the lexical analysis processing is performed on the text to be compiled to obtain a symbol sequence set, and the method comprises:

7. The method of claim 5, wherein parsing the set of symbol sequences and performing semantic actions corresponding to the parsed syntactical descriptions to obtain a compilable music text comprises:

8. An interactive compilation-based music generation apparatus, the apparatus comprising:

9. An electronic device, comprising a memory and a processor;

the memory is used for storing programs;

the processor executing the program implements the method of any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 7.