CN116052621A

CN116052621A - Music creation auxiliary method based on language model

Info

Publication number: CN116052621A
Application number: CN202310059481.2A
Authority: CN
Inventors: 张宇; 宋岩奇; 杨昕; 崔涵
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2023-01-17
Filing date: 2023-01-17
Publication date: 2023-05-02

Abstract

The invention relates to a music creation auxiliary method based on a language model. The invention relates to the technical field of artificial intelligence, which is realized based on a WEB end, can provide assistance for the creation enthusiasm of a user at any time and any place on various platforms, and greatly avoids missing the opportunity of recording and creation due to the problems of software and hardware. Meanwhile, the system can be used as a plug-in to be connected to any place needing music design and generation, such as a short video platform, and the invention realizes that the music recommendation of the short video platform is basically the recommendation of the existing music and video content at present.

Description

Music creation auxiliary method based on language model

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a music creation auxiliary method based on a language model.

Background

Music composition and design are often highly inaccessible in the public, and even for professionals who are working on music, there is a moment when composition is difficult due to insufficient inspiration and the like. Not only does the design and creation of music require a significant amount of time and cost, but more importantly, requires long time of expertise accumulation. This makes the design creation of music a matter of opinion.

The development of artificial intelligence in the current age is very rapid, approaching or even exceeding the level of manpower in many tasks in the areas of style recognition, audio processing and generation. How to use these techniques, which is a matter of consideration for the computer industry, benefits the mass life. Professional music creation software is not lacked in the market, but most of the software only reduces the complexity of music creation, and provides some simulated musical instruments to assist creators in creation. The invention starts from automatic music creation and utilizes the latest progress of artificial intelligence in the fields of voice processing and natural language processing to construct a model capable of assisting an creator in intelligently creating music. Meanwhile, the invention designs and develops a music creation auxiliary system based on a language model in hopes of reducing the threshold for creating music for amateur masses and promoting the starting point of inspiration of professional music producers.

Disclosure of Invention

The invention aims to overcome the defects of the prior art, realizes personalized multi-style lyrics and music generation algorithm, and can generate the music singed by the user according to the song provided by the user and corresponding to the lyrics. The invention provides a plurality of different combinations of songs and lyrics through multi-style song generation and multi-style lyrics generation, and simultaneously, the user only needs to upload the file of his own voice, so that the invention can generate the song combination of his singing according to his tone.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention provides a music creation auxiliary method based on a language model, which provides the following technical scheme:

a music composition assistance method based on a language model, the method comprising the steps of:

step 1: extracting features of song data, generating two-dimensional characterization under different time scales to obtain features of different levels of music, and performing two-dimensional convolution learning by using ResNet to classify style labels;

step 2: establishing a pre-training music model, and performing position coding to distinguish different positions in the sequence;

step 3: and performing multi-track collaborative learning to generate a music file.

Preferably, the step 1 specifically includes:

the input song data are divided into three parts of notes, audios and lyrics, and the characteristics in the three input data are extracted for training of models; for the note data, identifying the chord and the rhythm of the song by using a manual labeling mode, and generating a plurality of note sequences by using the multi-track music according to the tracks;

for lyric data, extracting semantic and emotion characteristics of the lyric data by using a pre-training language model based on a transducer;

for audio data, style features are extracted using either a manually labeled approach or using MS-SincResNet.

Preferably, when the audio data is marked, one-dimensional convolution learning is performed by using SincNet, two-dimensional characterization is generated under different time scales, so that the characteristics of different levels of music are learned, two-dimensional convolution learning is performed by using ResNet, and finally style labels are classified.

Preferably, the step 2 specifically includes:

step 2.1: the method comprises the steps of establishing a pre-training music model, adopting a coder-decoder structure of a transducer, inputting data including chord, rhythm and style labels of songs and aligned notes and lyrics texts, taking notes at each moment of the songs and corresponding lyrics and pinyin of the notes as a tuple, inputting the pre-training model into a tuple sequence, and adopting a word embedding technology to represent the styles, the rhythms, the chords, the notes and the lyrics by using dense vectors;

step 2.2: taking the existing note, lyrics, chord and rhythm sequences, spliced position codes and spliced style label vectors as inputs, inputting the output of the last layer of encoder as part of the characteristics of a decoder through an N-layer converter, obtaining the output through the N-layer converter decoder, and finally obtaining the probability distribution of the output by utilizing a multi-layer perceptron and softmax activation function, thereby determining the next note, lyrics, chord and rhythm;

step 2.3: the absolute position codes of the triangular functions are used, the model trains the position codes of each bar, and the position codes are repeated for a plurality of times according to the bar number of songs when the absolute position codes are input, so that the model can keep the position information of each note in the bar and ignore the position information of the notes corresponding to the positions among different bars.

Preferably, the training is performed by using MLM and NSP tasks in the training of the pre-training language model, wherein the NSP tasks are used for generating the next tuple through the existing tuple sequence; the MLM task is to randomly delete some tuples or some element in the tuple in the sequence, so that the model predicts the deleted data.

Preferably, the step 3 specifically includes:

training a pre-training music model for each instrument independently, modifying the self-training mechanism, and enabling the transducer layers among different instrument models to be coupled with each other, so that a larger multi-instrument pre-training model is formed, and the training target is summation of loss functions of each model;

the partial formula of self-saturation mechanism before modification is as follows:

the modified is represented by the following formula:

wherein ,

and->

Is a learnable matrix associated with the kth instrument for introducing information of other instruments in self-attitudes; x is x _j；k Representing input data of a kth instrument; x is x _j The input data of the current instrument is represented.

Preferably, the step 5 specifically includes:

when the user inputs chord, rhythm and phonetic symbol sequences, the model outputs lyric sequences; or the user inputs chord, rhythm and lyrics sequence, the model outputs note sequence; the user inputs continuous or intermittent song information, and the model completes the complete song content.

A language model based music composition assistance system, the system comprising:

the feature extraction module is used for carrying out feature extraction on song data, generating two-dimensional characterization under different time scales to obtain features of different levels of music, and carrying out two-dimensional convolution learning by using ResNet to classify style labels;

the model training module is used for establishing a pre-training music model, carrying out position coding and distinguishing different positions in the sequence;

and the music file generation module is used for carrying out multi-track collaborative learning to generate a music file.

A computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing a language model based music composition assistance method.

A computer device comprising a memory storing a computer program and a processor implementing a language model based music composition assistance method when executing the computer program.

The invention has the following beneficial effects:

compared with the prior art, the invention has the advantages that:

the invention is realized based on the WEB terminal, can provide assistance for the creation enthusiasm of users at any time and any place on various platforms, and greatly avoids missing the opportunity of recording and creation due to the problems of software and hardware. Meanwhile, the system can be used as a plug-in to be connected to any place needing music design and generation, such as a short video platform, and the invention realizes that the music recommendation of the short video platform is basically the recommendation of the existing music and video content at present.

The system development of the invention aims at assisting a music creator to complete the creation of music, and the core function of the system development is to create music aiming at different curved winds and genres. The music creation is divided into two parts of word creation and music creation, the notes in the music are regarded as words in natural language, and the creation of the words and the music is regarded as a text generation task in natural language processing, so that the music creation can be realized by using a natural language processing technology based on deep learning.

The system of the invention provides one-stop creation auxiliary function for users, thus integrating various functions including the functions of composing words, composing words by music, writing back music according to previous music, and the like, which can be summarized into sequence generating tasks with different input and output forms, and some advanced pre-training models in the natural language processing field can obviously well realize the tasks. However, unlike natural language, each note in a musical composition has rich features such as pitch, loudness, duration, etc., and the entire musical composition has structural features with more obvious periodicity such as rhythm, melody, chord, etc., and how to extract these features from the musical composition well is an important point in designing the model of the present invention. In addition, since a piece of music is usually played by a plurality of musical instruments, how to interwork different musical instruments is also an important issue in model design.

The present invention contemplates incorporating the wind and music styles of different music genres into the model generated music, so that the recognition of styles and genres is a precondition for the recognition of styles and genres, either identified by the system or manually specified by the user, to be entered as part of the creation of the music for each part of the system.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a diagram of an MS-SincResNet network architecture;

FIG. 2 is a diagram of aligned note and lyrics data;

FIG. 3 is a diagram of a pre-trained musical composition model;

FIG. 4 is a position-coding diagram of a model;

FIG. 5 is a flow chart;

FIG. 6 is a front end interface of one embodiment.

Detailed Description

The following description of the embodiments of the present invention will be made apparent and fully in view of the accompanying drawings, in which some, but not all embodiments of the invention are shown. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

In the description of the present invention, it should be noted that the directions or positional relationships indicated by the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer", etc. are based on the directions or positional relationships shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be configured and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

In the description of the present invention, it should be noted that, unless explicitly specified and limited otherwise, the terms "mounted," "connected," and "connected" are to be construed broadly, and may be either fixedly connected, detachably connected, or integrally connected, for example; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

In addition, the technical features of the different embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The present invention will be described in detail with reference to specific examples.

First embodiment:

according to the embodiments shown in fig. 1 to 6, the specific optimization technical scheme adopted by the present invention to solve the above technical problems is as follows: the invention relates to a music creation auxiliary method based on a language model.

The invention has the advantages that: the front-end interface is realized based on the Web end, and can be adapted and compatible with different kinds of electronic equipment. The pre-training music model trained by large-scale music data is used, so that the generated music is more abundant in variety and more similar to the artificially created music style.

Compared with most of the current music generation models, the method achieves the aim of multi-track collaborative training by modifying the self-attention mechanism, so that the coordination among the generated music instruments is more coordinated.

By expanding the position coding, the model can generate more periodic music.

Compared with most of music generation models at present, the invention integrates the chord, the melody, the rhythm and the style into the input data of the model, so that the music generated by the model has more melody.

Specific embodiment II:

the second embodiment of the present application differs from the first embodiment only in that:

the step 1 specifically comprises the following steps:

Third embodiment:

the difference between the third embodiment and the second embodiment of the present application is only that:

when the audio data is marked, the SincNet is utilized to carry out one-dimensional convolution learning, two-dimensional characterization is generated under different time scales, so that the characteristics of different levels of music are learned, the ResNet is utilized to carry out two-dimensional convolution learning, and finally style labels are classified.

Fourth embodiment:

the fourth embodiment of the present application differs from the third embodiment only in that:

the step 2 specifically comprises the following steps:

Fifth embodiment:

the fifth embodiment differs from the fourth embodiment only in that:

training by adopting MLM and NSP tasks in the training of a pre-training language model, wherein the NSP task is to generate a next tuple through the existing tuple sequence; the MLM task is to randomly delete some tuples or some element in the tuple in the sequence, so that the model predicts the deleted data.

Specific embodiment six:

the difference between the sixth embodiment and the fifth embodiment of the present application is only that:

the step 3 specifically comprises the following steps:

the modified is represented by the following formula:

wherein ,

and->

Specific embodiment seven:

the seventh embodiment of the present application differs from the sixth embodiment only in that:

Specific embodiment eight:

the eighth embodiment of the present application differs from the seventh embodiment only in that:

the invention provides a music creation auxiliary system based on a language model, which comprises:

Specific embodiment nine:

embodiment nine of the present application differs from embodiment eight only in that:

the present invention provides a computer-readable storage medium having stored thereon a computer program for execution by a processor for implementing, for example, a language model-based music composition assistance method.

Specific embodiment ten:

the tenth embodiment differs from the ninth embodiment only in that:

the invention provides a computer device, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes a music creation assisting method based on a language model when executing the computer program.

Specific example eleven:

embodiment eleven of the present application differs from embodiment eleven only in that:

the invention realizes personalized multi-style lyrics and music generation algorithm, and can generate the singing music of the user according to the lyrics provided by the user and corresponding to the song. The invention provides a plurality of different combinations of songs and lyrics through multi-style song generation and multi-style lyrics generation, and simultaneously, the user only needs to upload the file of his own voice, so that the invention can generate the song combination of his singing according to his tone.

The invention firstly needs to extract the characteristics, and the input data of the model in the invention is divided into three parts of notes, audios and lyrics, so that the invention needs to extract the characteristics in the three input data for one song for training the model. For the note data, the chord and the rhythm of the song are identified by using a manual labeling mode, and the multi-track music is used for generating a plurality of note sequences according to the tracks; for lyric data, the method utilizes a pre-training language model based on a transducer to extract semantic and emotion characteristics; for audio data, the invention uses a manual labeling mode or utilizes MS-SincResNet to extract the style characteristics, the architecture firstly utilizes SincNet to carry out one-dimensional convolution learning, and generates two-dimensional characterization under different time scales, thereby learning the characteristics of different levels of music, then utilizes ResNet to carry out two-dimensional convolution learning, and finally carries out style label classification. The architecture diagram of MS-SincResNet is as follows:

secondly, the invention needs to build a pre-training music model;

the downstream tasks of the present invention are accomplished by a pre-training model based on song data, as described below in terms of pre-training tasks, pre-training model design, and position-coding design.

Performing a pre-training task:

the pre-training music model adopts a codec structure of a transducer, input data are chord, rhythm and style labels of songs and aligned notes and lyrics texts, as shown in fig. 2, the notes at each moment of the songs and the corresponding lyrics and pinyin are regarded as a tuple, then the input of the pre-training model is a tuple sequence, and the styles, the rhythms, the chords, the notes and the lyrics are expressed by using dense vectors by adopting a word embedding technology. The invention uses MLM and NSP tasks which are common in the training of similar pre-training language models to train the pre-training model, and NSP tasks are used for generating the next tuple through the existing tuple sequence; the MLM task is to randomly delete some tuples or some element in the tuple in the sequence, and the other model predicts the deleted data.

The invention designs a pre-training model:

the model structure is a classical multi-layer transducer codec architecture, as shown in fig. 3:

the model takes the existing note, lyrics, chord and rhythm sequences, spliced position codes and spliced style label vectors as input, takes the output of the last layer of encoder as part of the characteristics of a decoder through an N-layer converter, obtains the output through the N-layer converter, and finally obtains the probability distribution of the output by utilizing a multi-layer perceptron and softmax activation function so as to determine the next note, lyrics, chord and rhythm.

And (3) performing position coding setting:

to solve the problem of the permutation of the self-attention mechanism, the input of the pre-trained model usually contains position codes to allow the model to distinguish between different positions in the sequence, so that the meaning represented by the presence of the same element at the different positions is different. The absolute position encoding of the trigonometric function is used in the present invention because a song is generally split into several bars according to chords and rhythms in consideration of the characteristics of music data, and the bars are not so different from each other. The model trains the position codes of each bar and repeats the position codes according to the bar number of the songs for a plurality of times when inputting, as shown in fig. 4, so as to achieve the purpose of making the model retain the position information of each note in the bar and neglecting the position information of the notes corresponding to the positions between the different bars as much as possible.

Still further, the present invention requires multitrack collaborative learning:

in order to enable models to learn the coordination between different musical instruments and different sound tracks, the invention separately trains a pre-trained music model aiming at each musical instrument, modifies a self-attitution mechanism and enables transducer layers among different musical instrument models to be mutually coupled so as to form a larger multi-musical instrument pre-trained model, wherein the training aim is summation of loss functions of each model.

The partial formula of the original self-saturation mechanism is as follows:

the related formulas modified by the invention are as follows:

wherein

And->

Is a learnable matrix associated with the kth instrument for introducing information of other instruments in self-attitudes. X is x _j；k Representing input data of a kth instrument; x is x _j The input data of the current instrument is represented.

The task of the music creation auxiliary system is basically consistent with the pre-training task of the pre-training music model, a user inputs a part of created music files in the system, the system generates other contents by using the trained pre-training model, such as chord, rhythm and phonetic symbol sequences input by the user, and the model outputs lyric sequences; or the user inputs chord, rhythm and lyrics sequence, the model outputs note sequence; the user can also input continuous or intermittent song information, and the model can complement complete song content. These tasks can both correspond to MLM and NSP tasks in the pre-training process, so that the downstream tasks have little change in model logic and do not require additional introduction of new models.

The system firstly reads the required characteristics from the input of the user, then inputs the characteristics into the multi-track pre-training music model to obtain the output of the model, and finally processes the output of the model into the corresponding music file according to the requirement of the user.

The system can be realized as a front-end interface similar to that shown in fig. 6, the model is deployed in a back-end server, the front-end interface requests to call, a user can press an Export button to import an audio file or manually import music by using electronic music equipment provided below the page, a simple music playing and displaying interface appears in the middle of the page after the import is successful, then the user can select one of different styles on the right side of the page, and press a Start button, the model can complement the music imported by the user, and the model is automatically processed into a new music file and prompts the user to download.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or N embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, "N" means at least two, for example, two, three, etc., unless specifically defined otherwise. Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more N executable instructions for implementing specific logical functions or steps of the process, and further implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention. Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or N wires, a portable computer cartridge (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory. It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the N steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

The above description is only a preferred implementation manner of the music creation assistance method based on the language model, and the protection scope of the music creation assistance method based on the language model is not limited to the above embodiments, and all technical solutions under the concept belong to the protection scope of the present invention. It should be noted that modifications and variations can be made by those skilled in the art without departing from the principles of the present invention, which is also considered to be within the scope of the present invention.

Claims

1. A music creation auxiliary method based on a language model is characterized in that: the method comprises the following steps:

2. The method according to claim 1, characterized in that: the step 1 specifically comprises the following steps:

3. The method according to claim 2, characterized in that: when the audio data is marked, the SincNet is utilized to carry out one-dimensional convolution learning, two-dimensional characterization is generated under different time scales, so that the characteristics of different levels of music are learned, the ResNet is utilized to carry out two-dimensional convolution learning, and finally style labels are classified.

4. A method according to claim 3, characterized in that: the step 2 specifically comprises the following steps:

5. The method according to claim 4, characterized in that:

6. The method according to claim 5, characterized in that: the step 3 specifically comprises the following steps:

the modified is represented by the following formula:

wherein ,

and->

7. The method according to claim 6, characterized in that: the step 5 specifically comprises the following steps:

8. A music creation auxiliary system based on a language model is characterized in that: the system comprises:

9. A computer readable storage medium having stored thereon a computer program, characterized in that the program is executed by a processor for implementing the method according to claims 1-7.

10. A computer device comprising a memory and a processor, the memory storing a computer program, characterized by: the processor, when executing the computer program, implements the method of claims 1-7.