CN115881063A - Music generation method, device and storage medium - Google Patents

Music generation method, device and storage medium Download PDF

Info

Publication number
CN115881063A
CN115881063A CN202111115570.1A CN202111115570A CN115881063A CN 115881063 A CN115881063 A CN 115881063A CN 202111115570 A CN202111115570 A CN 202111115570A CN 115881063 A CN115881063 A CN 115881063A
Authority
CN
China
Prior art keywords
information
target
music
training
rendering
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111115570.1A
Other languages
Chinese (zh)
Inventor
汪蕴哲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Original Assignee
Beijing Xiaomi Mobile Software Co Ltd
Beijing Xiaomi Pinecone Electronic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Xiaomi Mobile Software Co Ltd, Beijing Xiaomi Pinecone Electronic Co Ltd filed Critical Beijing Xiaomi Mobile Software Co Ltd
Priority to CN202111115570.1A priority Critical patent/CN115881063A/en
Publication of CN115881063A publication Critical patent/CN115881063A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Electrophonic Musical Instruments (AREA)

Abstract

The present disclosure relates to a music generation method, apparatus, and storage medium. The method comprises the following steps: determining chord information and rhythm type information corresponding to the music preference information according to the music preference information input by a user; generating target MIDI data according to the chord information and the rhythm type information; determining a target rendering rule, wherein the target rendering rule comprises a tone combination and a rendering sequence for tone rendering MIDI data; and performing tone rendering on the target MIDI data according to the target rendering rule to obtain target music. Therefore, based on chord information and rhythm information corresponding to the music preference of the user, MIDI data more conforming to the music preference of the user can be generated, and then the MIDI data is rendered according to the tone combination and the sequence, so that finally obtained target music has richer music information and more complete music structure, and further, the requirements of a disc playing scene on the music can be met.

Description

Music generation method, device and storage medium
Technical Field
The present disclosure relates to the field of music synthesis, and in particular, to a music generation method, apparatus, and storage medium.
Background
At present, in the field of music generation, the goal of synthesizing music is mainly achieved based on the techniques such as rules, probability models, deep learning, and the like, wherein in recent years, the music generation based on deep learning is more favored because of the greater limitations of the two techniques per se. Currently, in music generation based on deep learning, music generation for piano music, instrumental melody, popular composition and the like is mainly realized, and the generated music is relatively single, so that music with a relatively complex structure, such as electronic music used for playing music, cannot be generated.
Disclosure of Invention
To overcome the problems in the related art, the present disclosure provides a music generation method, apparatus, and storage medium.
According to a first aspect of embodiments of the present disclosure, there is provided a music generating method, the method including:
determining chord information and rhythm type information corresponding to the music preference information according to the music preference information input by a user;
generating target MIDI data according to the chord information and the rhythm type information;
determining a target rendering rule, wherein the target rendering rule comprises a tone combination and a rendering sequence for tone rendering MIDI data;
and performing tone rendering on the target MIDI data according to the target rendering rule to obtain target music.
Optionally, the generating target MIDI data according to the chord information and the rhythm type information includes:
inputting the chord information and the rhythm type information into a generating model to obtain note distribution information of a preset audio track output by the generating model, wherein the generating model is obtained by training a generative confrontation network;
and determining the target MIDI data according to the note distribution information.
Optionally, the generation model includes at least one generation submodel, and the generation submodel corresponds to the preset audio track one to one;
the inputting the chord information and the rhythm type information into a generation model to obtain note distribution information of a preset audio track output by the generation model includes:
acquiring noise information;
and respectively inputting the chord information, the rhythm model information and the noise information into each generation submodel to obtain the note distribution information output by each generation submodel so as to obtain the note distribution information of each preset audio track.
Optionally, the generative model is obtained by:
acquiring first training data, wherein the first training data comprises chord samples and rhythm samples;
determining a training model, wherein the training model comprises a generation network and a discriminator which respectively correspond to each preset audio track;
respectively inputting a target chord sample, a target rhythm sample and the obtained target noise into each generation network used by the training to obtain a first output result output by each generation network, wherein the target chord sample and the target rhythm sample are taken from the chord sample and the rhythm sample in the first training data;
for each preset audio track, inputting a first output result, the target chord sample and the target rhythm type sample corresponding to the preset audio track into a discriminator corresponding to the preset audio track to obtain a second output result output by the discriminator;
under the condition that the training stopping condition is not met, updating each generation network used in the training according to the second output result to obtain an updated training model, and using the updated training model for the next training;
and under the condition that the training stopping condition is met, respectively using the generation networks used in the training as generation submodels to obtain a generation model consisting of the generation submodels.
Optionally, the training model further comprises a global discriminant network;
the updating each generated network used in the training according to the second output result includes:
determining a first loss value of a generating network corresponding to each preset audio track according to a second output result corresponding to each preset audio track;
generating an input tensor according to a first output result, the target chord sample and the target rhythm type sample corresponding to each preset audio track;
inputting the input tensor to the global judgment network to obtain a third output result output by the global judgment network;
determining a second loss value according to the third output result;
and updating the network parameters of each generation network used by the training according to the first loss value and the second loss value.
Optionally, each of the generating submodels includes a first generator for generating note onset point distribution information and a second generator for generating note duration information.
Optionally, the note distribution information includes note start point distribution information and sustained note position information;
said determining said target MIDI data according to said note distribution information, comprising:
combining the note starting position information and the continuous note position information to obtain a combined result;
and denoising the merged result to obtain the target MIDI data.
Optionally, the determining a target rendering rule includes:
and determining a rendering rule corresponding to the music preference information input by the user as the target rendering rule according to the corresponding relation between the preset music preference information and the rendering rule.
Optionally, the determining a target rendering rule includes:
and inputting the music preference information into a pre-trained rule generating model to obtain an output result of the rule generating model as the target rendering rule, wherein the rule generating model is obtained by training a neural network model by using second training data, and the second training data comprises a plurality of groups of music preference information samples and rendering rule samples.
According to a second aspect of embodiments of the present disclosure, there is provided a music generating apparatus, the apparatus including:
the music processing device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is configured to determine chord information and rhythm type information corresponding to music preference information according to the music preference information input by a user;
a generating module configured to generate target MIDI data according to the chord information and the rhythm type information;
a second determination module configured to determine a target rendering rule including a tone combination and a rendering order for tone rendering MIDI data;
and the rendering module is configured to perform tone rendering on the target MIDI data according to the target rendering rule to obtain target music.
According to a third aspect of the embodiments of the present disclosure, there is provided a music generating apparatus including:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
determining chord information and rhythm type information corresponding to the music preference information according to the music preference information input by a user;
generating target MIDI data according to the chord information and the rhythm type information;
determining a target rendering rule, wherein the target rendering rule comprises a tone combination and a rendering sequence for tone rendering MIDI data;
and performing tone rendering on the target MIDI data according to the target rendering rule to obtain target music.
According to a fourth aspect of embodiments of the present disclosure, there is provided a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the music generation method provided by the first aspect of the present disclosure.
The technical scheme provided by the embodiment of the disclosure can have the following beneficial effects:
according to the technical scheme, the chord information and the rhythm type information corresponding to the music preference information are determined according to the music preference information input by the user, then the target MIDI data are generated according to the chord information and the rhythm type information, and the target rendering rule is determined, wherein the target rendering rule comprises the tone combination and the rendering sequence for performing tone rendering on the MIDI data, and the target MIDI data are subjected to tone rendering according to the target rendering rule to obtain the target music. Therefore, based on the preference of the user to the music, the chord information and the rhythm information corresponding to the user are determined, the MIDI data corresponding to the chord information and the rhythm information are regenerated, and then the MIDI data are rendered according to the determined tone combination and rendering sequence, so as to generate the final target music. Therefore, based on chord information and rhythm information corresponding to the music preference of the user, MIDI data more conforming to the music preference of the user can be generated, and then the MIDI data is rendered according to the tone combination and the sequence, so that finally obtained target music has richer music information and more complete music structure, and further, the requirements of a disc playing scene on the music can be met.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a flowchart illustrating a music generation method according to an exemplary embodiment.
Fig. 2 is an exemplary flowchart of the steps of generating target MIDI data from chord information and tempo type information in the music generation method provided according to an exemplary embodiment.
FIG. 3 is an exemplary flow chart for generating a target model in a music generation method according to an exemplary embodiment.
Fig. 4 is a block diagram illustrating a music generating apparatus according to an exemplary embodiment.
Fig. 5 is a block diagram illustrating a music generating apparatus according to an exemplary embodiment.
Fig. 6 is a block diagram illustrating a music generating apparatus according to an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. The following description refers to the accompanying drawings in which the same numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.
Before describing the method provided by the present disclosure, an application scenario related to each embodiment of the present disclosure is described first. The application scenario to which the present disclosure relates may be any scenario that requires automatic generation of music for a user in response to a demand placed by the user (i.e., the user's music preferences). The application scenario may be that the electronic device generates corresponding music for the user based on a selection made by the user from the provided music preference options, or the electronic device generates corresponding music for the user based on a music preference actively input by the user (e.g., text input, voice input, etc.). Taking the user actively inputting music preferences as an example, the user may tune the voice assistant function of the electronic device and voice-input music preferences such as favorite music styles and music scenes, and the electronic device generates music according with the music preferences for the user based on the music preferences input by the user to provide the appropriate music for the user. The electronic device may be a terminal or a background server. For example, if the electronic device is a terminal, the terminal may be a mobile terminal such as a smart phone, a tablet computer, a smart watch, a smart bracelet, a PDA (Personal Digital Assistant), or a fixed terminal such as a desktop computer.
Fig. 1 is a flowchart illustrating a music generation method according to an exemplary embodiment. As described in the foregoing, the method provided by the present disclosure may be applied to an electronic device, for example, a terminal or a server. As shown in fig. 1, the method may include steps 11 to 14.
In step 11, chord information and rhythm type information corresponding to the music preference information are determined according to the music preference information input by the user.
The music preference information may include a music genre and a music scene, among others. The music style may be, for example, soothing, intense, etc. The music scene may be travel, lunch break, sports, etc.
In a possible implementation manner, the correspondence between the music preference information and the chord information and the rhythm type information may be preset, so that based on the music preference information input by the user, the corresponding chord information and rhythm type information may be determined based on the correspondence.
In another possible implementation manner, a plurality of chord information and rhythm information labeled with music preference information may be obtained as training data, and the neural network model is trained to obtain a model capable of generating corresponding chord information and rhythm information based on the input music preference information. In actual use, the music preference information input by the user is input into the model, and the output result of the model is the chord information and the rhythm information corresponding to the music preference information input by the user.
The chord information and the rhythm information may be both matrices in which the number of lines is the number of music beats and the number of columns is the number of pitches. For example, the chord information and the rhythm type information may each be a matrix of 32 × 128, where 32 denotes a 32-point note and 128 denotes a pitch number.
In step 12, target MIDI data is generated based on the chord information and the rhythm type information.
In one possible embodiment, step 12 may include step 21 and step 22, as shown in fig. 2.
In step 21, the chord information and the rhythm information are input into the generation model, and the note distribution information of the preset audio track output by the generation model is obtained.
Wherein, the generative model is obtained by training the generative confrontation network. The Generative Adaptive Networks (GAN) is a deep learning Model, one of the most promising methods for unsupervised learning in complex distribution in recent years, and generates a better output through mutual game learning of a Generative Model and a discriminant Model by (at least) two modules in a framework. In practice, the chord information and the rhythm type information are used for training as the condition information of the generative confrontation network.
In one possible embodiment, step 21 may comprise the steps of:
acquiring noise information;
and respectively inputting the chord information, the rhythm model information and the noise information into each generation submodel to obtain the note distribution information output by each generation submodel so as to obtain the note distribution information of each preset audio track.
The noise information may be random noise, for example, noise randomly sampled from a standard normal distribution. The generating model may include at least one generating sub-model, and the generating sub-models correspond to the preset audio tracks one to one. The MIDI data may be single track MIDI data or multi track MIDI data, wherein each track of the MIDI data corresponds to one instrument.
Illustratively, the generative model may be obtained by steps 31 to 36 as shown in fig. 3.
In step 31, first training data is acquired.
Wherein the first training data may include a chord sample and a rhythm type sample. The chord sample and the rhythm sample can be manually marked based on the existing music score. For example, the chord sample and the rhythm sample may be both matrices in which the number of rows is the number of music beats and the number of columns is the number of pitches. For example, the chord sample and the rhythm type sample may each be a matrix of 32 × 128, where 32 represents a 32-point note and 128 represents a pitch number. For a chord, the chord tonic at each time step thereof may be set to 1, and the other positions may be set to 0, to form a chord sample. For the rhythm type, any pitch of the rephotograph time step can be set to 1, and the other positions of the matrix can be set to 0, so as to form a rhythm type sample.
In step 32, a training model is determined.
Wherein the training model may include a generation network and a discriminator corresponding to each preset track, respectively. In determining the training model, it is necessary to construct a structure of the training model based on the number of tracks of MIDI data that needs to be finally generated. That is, each preset track corresponds to a set of generation networks and discriminators.
In step 33, the target chord sample, the target rhythm sample, and the obtained target noise are respectively input to each generation network used in the training, so as to obtain a first output result output by each generation network.
Wherein the target chord sample is taken from a chord sample in the first training data and the target rhythm pattern sample is taken from a rhythm pattern sample in the first training data. The target noise may be noise randomly sampled from a standard normal distribution.
Each generation network may include a first generator for generating note onset point distribution information and a second generator for generating note duration information. Accordingly, the first output result may be information for characterizing the distribution of the notes. The first output result may be a matrix including note onset distribution information and sustain note position information, and the matrix included in the first output result may be consistent with a matrix structure corresponding to the chord information and the rhythm type information. Note onset distribution information is used to identify where a note onset occurs, and sustain note location information is used to indicate where a note persists. For example, in the matrix corresponding to the note start point distribution information, the start position of each note occurrence is 1 and the other positions are 0, and correspondingly, in the matrix corresponding to the sustained note position information, the position where the note is sustained is 1 and the other positions are 0.
In step 34, for each preset audio track, the first output result, the target chord sample and the target rhythm type sample corresponding to the preset audio track are input to the discriminator corresponding to the preset audio track, so as to obtain a second output result output by the discriminator.
In training the generated network, the arbiter may use an arbiter with fixed internal parameters. The discriminator may be trained in advance using first training data, that is, using an untrained generation network to generate an output result, and sending the output result and a real MIDI data sample to the discriminator to train the discrimination capability of the discriminator, and obtain a trained discriminator having a certain discrimination capability, and fixing internal parameters thereof for training of the generation network.
The first output result is two matrixes including the distribution situation of the notes, the target chord sample and the target rhythm type sample are also two matrixes, and the matrixes of the four matrixes have the same structure, so that a new matrix (or tensor) formed by the four matrixes can be input into the corresponding discriminator to obtain a second output result of the discriminator. And the second output result is the score of the first output result by the discriminator, and the score is used for reflecting whether the discriminator judges the first output result to be real data. In the present disclosure, the purpose of training the generation network is to enable the generation network to generate data that is as realistic as possible, so that the discriminator cannot recognize whether the output content of the generation network is true or false, i.e., the score (i.e., the second output result) output by the discriminator for the first output result is very close to the score of the realistic data.
In step 35, if the stop training condition is not satisfied, each generation network used in the current training is updated according to the second output result to obtain an updated training model, and the updated training model is used for the next training.
For example, the training stopping condition may be that the training duration reaches a preset duration, the training times reach preset times, and the model loss value of the current training is lower than a preset threshold.
In a possible implementation manner, when the training stopping condition is not satisfied, a first loss value of the generation network corresponding to each preset audio track may be determined according to the second output result corresponding to each preset audio track, and the generation network used this time may be updated by using the first loss value.
Each second output result corresponds to the score output by one discriminator, meanwhile, the real data also has a corresponding score, and the first loss value can be determined by using the second output result and the score corresponding to the real data aiming at each second output result. For example, a cross-entropy loss function may be calculated using the second output result and the score corresponding to the real data to obtain a first loss value.
In another possible implementation, if the training model includes a plurality of generating networks, the training model may further include a global decision network for performing comprehensive evaluation according to the output content of each generating network. Therefore, in the case that the stop training condition is not satisfied, the following steps may be further included:
generating an input tensor according to a first output result, a target chord sample and a target rhythm type sample corresponding to each preset audio track;
inputting the input tensor into the global judgment network to obtain a third output result output by the global judgment network;
determining a second loss value according to the third output result;
and updating the network parameters of each generation network used by the training according to the first loss value and the second loss value.
The global decision network is also essentially a discriminator for evaluating whether the output of each generation network used for training can be taken as real data as a whole, and therefore the third output is the score output by the global decision network for the input tensor. And determining a second loss value according to a third output result of the global judgment network and the score corresponding to the real data. For example, a cross-entropy loss function may be calculated using the third output result and the score corresponding to the real data to obtain a second loss value.
By the method, under the condition that the training stopping condition is not met, the first loss value is calculated, the second loss value can be comprehensively calculated, the generated network is updated by combining the two loss values, and the generated network with excellent effect can be obtained more quickly.
In step 36, when the stop training condition is satisfied, the generation networks used in the present training are respectively used as the generation submodels to obtain the generation model composed of the generation submodels.
Referring to the structure of the generation network, each of the generation submodels may also correspondingly include a first generator for generating note onset point distribution information and a second generator for generating note duration information.
Returning to fig. 2, in step 22, target MIDI data is determined based on the note distribution information.
In one possible embodiment, step 22 may include the steps of:
combining the note starting position information and the continuous note position information to obtain a combined result;
and denoising the merged result to obtain target MIDI data.
The note distribution information may include note onset distribution information and sustain note position information. As described above, the note onset position information and the sustain note position information are two matrices with the same structure, and the start of the note is 1, the sustain note is 1, and the rest is 0 in the matrices, so if the two matrices are added, the onset and sustain of each note can be identified, i.e., the position of each note is located. Therefore, the note onset position information and the sustain note position information can be combined to obtain a combined result.
After the combined result is obtained, denoising processing can be further carried out. The merged result includes three kinds of data, where the position of both note onset and note continuation is 2, the position of note continuation is 1, and neither is still 0, so that if a piece of data in the matrix is 211111, it means that the piece corresponds to a note, and if a piece of data in the matrix is 01111, it is considered that the piece is noise, and the piece of data can be denoised.
In step 13, a target rendering rule is determined.
Wherein the target rendering rule includes a tone combination and a rendering order for tone rendering the MIDI data. For example, where the MIDI data contains three tracks, corresponding to a drum, chord and bass, respectively, the target rendering rules may be to render the MIDI data twice in a loop, only the chord track a first time, and the chord, drum and bass second time.
In one possible embodiment, step 13 may include the steps of:
and determining a rendering rule corresponding to the music preference information input by the user as a target rendering rule according to the corresponding relation between the preset music preference information and the rendering rule.
In another possible embodiment, step 13 may include the steps of:
and inputting the music preference information into a pre-trained rule generation model to obtain an output result of the rule generation model as a target rendering rule.
The rule generation model is obtained by training a neural network model by using second training data, and the second training data comprises a plurality of groups of music preference information samples and rendering rule samples.
Illustratively, the rule generation model may be generated by:
constructing a neural network model;
selecting a group of music preference information samples and rendering rule samples used in the training from the second training data as target music preference information samples and target rendering rule samples;
inputting the target preference information sample into a neural network model used in the training to obtain a fourth output result of the neural network model;
when the condition of stopping training is not met, calculating a model loss value by using the fourth output result and the target rendering rule sample, updating the neural network model used in the training according to the loss value, and performing the next training by using the updated neural network model;
and when the condition of stopping training is met, determining the neural network model used in the training as a rule generation model.
In step 14, tone rendering is performed on the target MIDI data according to the target rendering rule to obtain the target music.
According to the technical scheme, the chord information and the rhythm type information corresponding to the music preference information are determined according to the music preference information input by the user, then the target MIDI data are generated according to the chord information and the rhythm type information, and the target rendering rule is determined, wherein the target rendering rule comprises the tone combination and the rendering sequence for performing tone rendering on the MIDI data, and the target MIDI data are subjected to tone rendering according to the target rendering rule to obtain the target music. Therefore, based on the preference of the user to the music, the chord information and the rhythm type information corresponding to the user are determined, the MIDI data corresponding to the chord information and the rhythm type information are regenerated, and then the MIDI data are rendered according to the determined tone combination and rendering sequence to generate the final target music. Therefore, based on chord information and rhythm information corresponding to the music preference of the user, MIDI data more conforming to the music preference of the user can be generated, and then the MIDI data is rendered according to the tone combination and the sequence, so that finally obtained target music has richer music information and more complete music structure, and further, the requirements of a disc playing scene on the music can be met.
Fig. 4 is a block diagram illustrating a music generating apparatus according to an exemplary embodiment. Referring to fig. 4, the apparatus 40 may include:
a first determining module 41 configured to determine chord information and rhythm type information corresponding to music preference information according to the music preference information input by a user;
a generating module 42 configured to generate target MIDI data from the chord information and the rhythm type information;
a second determination module 43 configured to determine a target rendering rule including a tone combination and a rendering order for tone rendering MIDI data;
and a rendering module 44 configured to perform timbre rendering on the target MIDI data according to the target rendering rule to obtain the target music.
Optionally, the generating module 42 includes:
a first processing submodule configured to input the chord information and the rhythm type information into a generative model, and obtain note distribution information of a preset audio track output by the generative model, wherein the generative model is obtained by training a generative confrontation network;
a first determining sub-module configured to determine the target MIDI data according to the note distribution information.
Optionally, the generation model includes at least one generation submodel, and the generation submodel corresponds to the preset audio track one to one;
the first processing submodule includes:
an acquisition submodule configured to acquire noise information;
and the second processing sub-module is configured to input the chord information, the rhythm model information and the noise information into each generation sub-model respectively to obtain note distribution information output by each generation sub-model so as to obtain note distribution information of each preset audio track.
Optionally, the generative model is obtained by:
the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is configured to acquire first training data, and the first training data comprises chord samples and rhythm samples;
a third determination module configured to determine a training model including a generation network and a discriminator respectively corresponding to the respective preset tracks;
the first processing module is configured to input a target chord sample, a target rhythm sample and obtained target noise to each generation network used in the training respectively to obtain a first output result output by each generation network, wherein the target chord sample and the target rhythm sample are taken from the chord sample and the rhythm sample in the first training data;
the second processing module is configured to input a first output result, the target chord sample and the target rhythm type sample corresponding to each preset audio track to a discriminator corresponding to the preset audio track so as to obtain a second output result output by the discriminator, wherein the second output result is a score of the discriminator on the first output result, and the score is used for reflecting whether the discriminator judges the first output result to be real data;
the updating module is configured to update each generation network used in the training according to the second output result under the condition that the training stopping condition is not met, so as to obtain an updated training model, and the updated training model is used for the next training;
and the fourth determining module is configured to respectively use the generating networks used in the training as generating submodels under the condition that the training stopping condition is met, so as to obtain the generating model consisting of the generating submodels.
Optionally, the training model further comprises a global discriminant network;
the update module includes:
the second determining submodule is configured to determine a first loss value of the generating network corresponding to each preset audio track according to a second output result corresponding to each preset audio track;
the generation sub-module is configured to generate an input tensor according to a first output result, the target chord sample and the target rhythm type sample corresponding to each preset audio track;
a third processing submodule configured to input the input tensor to the global decision network, and obtain a third output result output by the global decision network;
a third determining submodule configured to determine a second loss value according to the third output result;
and the updating submodule is configured to update the network parameters of each generation network used in the training according to the first loss value and the second loss value.
Optionally, each of the generating submodels includes a first generator for generating note onset point distribution information and a second generator for generating note duration information.
Optionally, the note distribution information includes note onset distribution information and sustained note position information;
the first determination submodule includes:
a merging submodule configured to merge the note start position information and the note duration position information to obtain a merged result;
and the denoising submodule is configured to denoise the combined result to obtain the target MIDI data.
Optionally, the second determining module 43 includes:
a fourth determining sub-module configured to determine, according to a preset correspondence between music preference information and a rendering rule, a rendering rule corresponding to the music preference information input by a user as the target rendering rule.
Optionally, the second determining module 43 includes:
and the fourth processing submodule is configured to input the music preference information into a pre-trained rule generating model, and obtain an output result of the rule generating model as the target rendering rule, wherein the rule generating model is obtained by training a neural network model by using second training data, and the second training data comprises a plurality of groups of music preference information samples and rendering rule samples.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
The present disclosure also provides a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the music generation method provided by the present disclosure.
Fig. 5 is a block diagram illustrating a music generation apparatus 800 according to an example embodiment. For example, the apparatus 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.
Referring to fig. 5, the apparatus 800 may include one or more of the following components: a processing component 802, a memory 804, a power component 806, a multimedia component 808, an audio component 810, an input/output (I/O) interface 812, a sensor component 814, and a communication component 816.
The processing component 802 generally controls overall operation of the device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the music generation method described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 may include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.
The memory 804 is configured to store various types of data to support operations at the apparatus 800. Examples of such data include instructions for any application or method operating on device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
Power component 806 provides power for the various components of device 800. The power components 806 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the device 800.
The multimedia component 808 includes a screen that provides an output interface between the device 800 and the user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 800 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.
The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the apparatus 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.
The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.
The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the device 800. For example, the sensor assembly 814 may detect the open/closed status of the device 800, the relative positioning of components, such as a display and keypad of the device 800, the sensor assembly 814 may also detect a change in the position of the device 800 or a component of the device 800, the presence or absence of user contact with the device 800, the orientation or acceleration/deceleration of the device 800, and a change in the temperature of the device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
The communication component 816 is configured to facilitate communication between the apparatus 800 and other devices in a wired or wireless manner. The device 800 may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, ultra Wideband (UWB) technology, bluetooth (BT) technology, and other technologies.
In an exemplary embodiment, the apparatus 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), digital Signal Processors (DSPs), digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described music generation method.
In an exemplary embodiment, a non-transitory computer-readable storage medium comprising instructions, such as the memory 804 comprising instructions, executable by the processor 820 of the device 800 to perform the music generation method described above is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.
In another exemplary embodiment, a computer program product is also provided, which comprises a computer program executable by a programmable apparatus, the computer program having code portions for performing the above-mentioned music generation method when executed by the programmable apparatus.
Fig. 6 is a block diagram illustrating a music generation apparatus 1900 according to an example embodiment. For example, the apparatus 1900 may be provided as a server. Referring to FIG. 6, the device 1900 includes a processing component 1922 further including one or more processors and memory resources, represented by memory 1932, for storing instructions, e.g., applications, executable by the processing component 1922. The application programs stored in memory 1932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 1922 is configured to execute instructions to perform the music generation method described above.
The device 1900 may also include a power component 1926 configured to perform power management of the device 1900, a wired or wireless network interface 1950 configured to connect the device 1900 to a network, and an input/output (I/O) interface 1958. The device 1900 may operate based on an operating system, such as Windows Server, stored in memory 1932 TM ,Mac OS X TM ,Unix TM ,Linux TM ,FreeBSD TM Or the like.
Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice in the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.
It will be understood that the present disclosure is not limited to the precise arrangements that have been described above and shown in the drawings, and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims (12)

1. A music generation method, the method comprising:
determining chord information and rhythm type information corresponding to the music preference information according to the music preference information input by a user;
generating target MIDI data according to the chord information and the rhythm type information;
determining a target rendering rule, wherein the target rendering rule comprises a tone combination and a rendering sequence for tone rendering MIDI data;
and performing tone rendering on the target MIDI data according to the target rendering rule to obtain target music.
2. The method of claim 1, wherein generating target MIDI data from the chord information and the rhythm type information comprises:
inputting the chord information and the rhythm type information into a generating model to obtain note distribution information of a preset audio track output by the generating model, wherein the generating model is obtained by training a generative confrontation network;
and determining the target MIDI data according to the note distribution information.
3. The method according to claim 2, wherein the generative model comprises at least one generative submodel, the generative submodel corresponding one-to-one to the preset audio tracks;
the inputting the chord information and the rhythm type information into a generating model to obtain note distribution information of a preset audio track output by the generating model includes:
acquiring noise information;
and respectively inputting the chord information, the rhythm model information and the noise information into each generation submodel to obtain the note distribution information output by each generation submodel so as to obtain the note distribution information of each preset audio track.
4. The method of claim 3, wherein the generative model is obtained by:
acquiring first training data, wherein the first training data comprises a chord sample and a rhythm sample;
determining a training model, wherein the training model comprises a generation network and a discriminator which respectively correspond to each preset audio track;
respectively inputting a target chord sample, a target rhythm sample and the obtained target noise into each generation network used by the training to obtain a first output result output by each generation network, wherein the target chord sample and the target rhythm sample are taken from the chord sample and the rhythm sample in the first training data;
for each preset audio track, inputting a first output result, the target chord sample and the target rhythm type sample corresponding to the preset audio track into a discriminator corresponding to the preset audio track to obtain a second output result output by the discriminator, wherein the second output result is a score of the discriminator on the first output result, and the score is used for reflecting whether the discriminator judges the first output result to be real data;
under the condition that the training stopping condition is not met, updating each generation network used in the training according to the second output result to obtain an updated training model, and using the updated training model for the next training;
and under the condition that the training stopping condition is met, respectively using the generation networks used in the training as generation submodels to obtain a generation model consisting of the generation submodels.
5. The method of claim 4, wherein the training model further comprises a global discriminant network;
the updating each generated network used in the training according to the second output result includes:
determining a first loss value of a generating network corresponding to each preset audio track according to a second output result corresponding to each preset audio track;
generating an input tensor according to a first output result, the target chord sample and the target rhythm type sample corresponding to each preset audio track;
inputting the input tensor to the global judgment network to obtain a third output result output by the global judgment network;
determining a second loss value according to the third output result;
and updating the network parameters of each generation network used by the training according to the first loss value and the second loss value.
6. A method according to claim 3, wherein each of the generating submodels includes a first generator for generating note onset point distribution information and a second generator for generating note duration information.
7. The method of claim 2, wherein the note distribution information includes note onset distribution information and sustain note location information;
said determining said target MIDI data according to said note distribution information, comprising:
combining the note starting position information and the continuous note position information to obtain a combined result;
and denoising the merged result to obtain the target MIDI data.
8. The method of claim 1, wherein determining the target rendering rule comprises:
and determining a rendering rule corresponding to the music preference information input by the user as the target rendering rule according to the corresponding relation between the preset music preference information and the rendering rule.
9. The method of claim 1, wherein determining the target rendering rule comprises:
and inputting the music preference information into a pre-trained rule generating model to obtain an output result of the rule generating model as the target rendering rule, wherein the rule generating model is obtained by training a neural network model by using second training data, and the second training data comprises a plurality of groups of music preference information samples and rendering rule samples.
10. An apparatus for generating music, the apparatus comprising:
the music processing device comprises a first determining module, a second determining module and a third determining module, wherein the first determining module is configured to determine chord information and rhythm type information corresponding to music preference information according to the music preference information input by a user;
a generating module configured to generate target MIDI data according to the chord information and the rhythm type information;
a second determination module configured to determine a target rendering rule including a tone combination and a rendering order for tone rendering MIDI data;
and the rendering module is configured to perform tone rendering on the target MIDI data according to the target rendering rule to obtain target music.
11. A music generating apparatus, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor is configured to:
determining chord information and rhythm type information corresponding to the music preference information according to the music preference information input by a user;
generating target MIDI data according to the chord information and the rhythm type information;
determining a target rendering rule, wherein the target rendering rule comprises a tone combination and a rendering sequence for performing tone rendering on MIDI data;
and performing tone rendering on the target MIDI data according to the target rendering rule to obtain target music.
12. A computer-readable storage medium, on which computer program instructions are stored, which program instructions, when executed by a processor, carry out the steps of the method according to any one of claims 1 to 9.
CN202111115570.1A 2021-09-23 2021-09-23 Music generation method, device and storage medium Pending CN115881063A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111115570.1A CN115881063A (en) 2021-09-23 2021-09-23 Music generation method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111115570.1A CN115881063A (en) 2021-09-23 2021-09-23 Music generation method, device and storage medium

Publications (1)

Publication Number Publication Date
CN115881063A true CN115881063A (en) 2023-03-31

Family

ID=85762358

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111115570.1A Pending CN115881063A (en) 2021-09-23 2021-09-23 Music generation method, device and storage medium

Country Status (1)

Country Link
CN (1) CN115881063A (en)

Similar Documents

Publication Publication Date Title
CN109801644B (en) Separation method, separation device, electronic equipment and readable medium for mixed sound signal
CN109862393B (en) Method, system, equipment and storage medium for dubbing music of video file
CN107105314A (en) Video broadcasting method and device
CN113382274B (en) Data processing method and device, electronic equipment and storage medium
CN113099297B (en) Method and device for generating click video, electronic equipment and storage medium
CN113409764B (en) Speech synthesis method and device for speech synthesis
CN112426724A (en) Game user matching method and device, electronic equipment and storage medium
CN110660375B (en) Method, device and equipment for generating music
CN115273831A (en) Voice conversion model training method, voice conversion method and device
CN111831249A (en) Audio playing method and device, storage medium and electronic equipment
CN112691385B (en) Method and device for acquiring outgoing and installed information, electronic equipment, server and storage medium
CN113177419A (en) Text rewriting method, device, storage medium and electronic equipment
CN113113040A (en) Audio processing method and device, terminal and storage medium
CN115881063A (en) Music generation method, device and storage medium
WO2021171900A1 (en) Estimation device, estimation method, and estimation system
CN114356068B (en) Data processing method and device and electronic equipment
CN113923517A (en) Background music generation method and device and electronic equipment
CN113691838A (en) Audio bullet screen processing method and device, electronic equipment and storage medium
CN114550691A (en) Multi-tone word disambiguation method and device, electronic equipment and readable storage medium
CN110120211B (en) Melody structure-based melody generation method and device
CN113420553A (en) Text generation method and device, storage medium and electronic equipment
CN109524025B (en) Singing scoring method and device, electronic equipment and storage medium
CN112698757A (en) Interface interaction method and device, terminal equipment and storage medium
CN113707122B (en) Method and device for constructing voice synthesis model
CN110929055A (en) Multimedia quality detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination