WO2022111242A1 - 旋律生成方法、装置、可读介质及电子设备 - Google Patents

旋律生成方法、装置、可读介质及电子设备 Download PDF

Info

Publication number
WO2022111242A1
WO2022111242A1 PCT/CN2021/128322 CN2021128322W WO2022111242A1 WO 2022111242 A1 WO2022111242 A1 WO 2022111242A1 CN 2021128322 W CN2021128322 W CN 2021128322W WO 2022111242 A1 WO2022111242 A1 WO 2022111242A1
Authority
WO
WIPO (PCT)
Prior art keywords
melody
information
target
training
encoder
Prior art date
Application number
PCT/CN2021/128322
Other languages
English (en)
French (fr)
Inventor
顾宇
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2022111242A1 publication Critical patent/WO2022111242A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H1/00Details of electrophonic musical instruments
    • G10H1/0008Associated control or indicating means
    • G10H1/0025Automatic or semi-automatic music composition, e.g. producing random music, applying rules from music theory or modifying a musical piece
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2210/00Aspects or methods of musical processing having intrinsic musical character, i.e. involving musical theory or musical parameters or relying on musical knowledge, as applied in electrophonic musical tools or instruments
    • G10H2210/101Music Composition or musical creation; Tools or processes therefor
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2240/00Data organisation or data communication aspects, specifically adapted for electrophonic musical tools or instruments
    • G10H2240/011Files or data streams containing coded musical information, e.g. for transmission
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/005Algorithms for electrophonic musical instruments or musical processing, e.g. for automatic composition or resource allocation
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10HELECTROPHONIC MUSICAL INSTRUMENTS; INSTRUMENTS IN WHICH THE TONES ARE GENERATED BY ELECTROMECHANICAL MEANS OR ELECTRONIC GENERATORS, OR IN WHICH THE TONES ARE SYNTHESISED FROM A DATA STORE
    • G10H2250/00Aspects of algorithms or signal processing methods without intrinsic musical character, yet specifically adapted for or used in electrophonic musical processing
    • G10H2250/311Neural networks for electrophonic musical instruments or musical processing, e.g. for musical recognition or control, automatic composition or improvisation

Definitions

  • the present disclosure relates to the field of speech synthesis, and in particular, to a method, apparatus, readable medium and electronic device for generating a melody.
  • Lyric-based melody generation has been a hot topic in the field of computer music. Since the lyrics do not contain musical information, and the neural network itself has no common sense, it cannot combine various emotions and musical knowledge to create a suitable melody. Therefore, using neural networks to convert these irrelevant semantic information is a huge problem. . Moreover, since the lyrics themselves are not musical and are not part of the music theory system, it is easy to produce melodies that do not conform to subjective musical aesthetics.
  • the present disclosure provides a method for generating a melody, the method comprising:
  • the target lyric information is input into the melody generation model, and the melody information output by the melody generation model is obtained, wherein the melody generation model is obtained according to the training of an autoencoder conditioned on the target lyric information;
  • a target melody corresponding to the target lyric information is synthesized.
  • the present disclosure provides an apparatus for generating a melody, the apparatus comprising:
  • the first acquisition module is used to acquire the target lyrics information for generating the melody
  • the melody information generation module is used for inputting the target lyric information into the melody generation model to obtain the melody information output by the melody generation model, wherein the melody generation model is based on an automatic encoder conditioned on the target lyric information trained;
  • a melody synthesis module configured to synthesize a target melody corresponding to the target lyric information according to the melody information.
  • the present disclosure provides a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the method described in the first aspect of the present disclosure.
  • the present disclosure provides an electronic device, comprising:
  • a processing device is configured to execute the computer program in the storage device to implement the steps of the method in the first aspect of the present disclosure.
  • the target lyric information for generating the melody is obtained, and the target lyric information is input into the melody generation model to obtain the melody information output by the melody generation model, and then according to the melody information, the target melody corresponding to the target lyric information is synthesized .
  • the melody generation model is obtained by training the autoencoder conditioned on the target lyric information.
  • FIG. 1 is a flowchart of a melody generation method provided according to an embodiment of the present disclosure
  • Fig. 2 is according to the melody generation method provided by the present disclosure, an exemplary schematic diagram of the musical meaning of the duration quantization value
  • FIG. 3 is an exemplary schematic diagram of the musical meaning of pause quantization values in the melody generation method provided according to the present disclosure
  • FIG. 4 is an exemplary schematic diagram of the structure of a conditional variational auto-encoder in the melody generation method provided according to the present disclosure
  • FIG. 5 is a block diagram of a melody generating apparatus provided according to an embodiment of the present disclosure.
  • FIG. 6 shows a schematic structural diagram of an electronic device suitable for implementing an embodiment of the present disclosure.
  • the term “including” and variations thereof are open-ended inclusions, ie, "including but not limited to”.
  • the term “based on” is “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • FIG. 1 is a flowchart of a method for generating a melody according to an embodiment of the present disclosure. As shown in Figure 1, the method may include the following steps:
  • step 11 obtain the target lyrics information for generating melody
  • step 12 the target lyric information is input into the melody generation model, and the melody information output by the melody generation model is obtained;
  • step 13 a target melody corresponding to the target lyrics is synthesized according to the melody information.
  • the melody generation model is trained from the autoencoder conditioned on the target lyrics.
  • An object of the present disclosure is to generate a melody conforming to the target lyric information based on the target lyric information.
  • the lyric information may include a word sequence and a syllable sequence corresponding to the word sequence.
  • the target lyric information may include a target word sequence and a target syllable sequence corresponding to the target word sequence.
  • the target lyrics can be segmented first, and a sequence of target words can be generated according to the appearance order of each segment in the target lyrics.
  • a preset word vector dictionary can be used to determine the syllables of each word in the target word sequence, and the syllables corresponding to each word are arranged according to the order of each word in the target word sequence to generate the target word sequence.
  • the target syllable sequence corresponding to the word sequence.
  • each word and a syllable are encoded as a word vector, and each syllable is aligned with its corresponding word.
  • the target word sequence and the target syllable sequence can be input into the melody generation model to obtain the melody information output by the melody generation model.
  • the melody information can represent what the melody is like, so that a piece of melody can be easily synthesized based on the content contained in the melody information.
  • the melody information may include at least one of octave, pitch, duration and pause.
  • the melody generation model can be obtained in the following manner:
  • the auto-encoder is trained, and the trained auto-encoder is obtained;
  • the melody generation model is obtained.
  • the automatic encoder includes an encoder layer and a decoder layer.
  • Each set of training data includes historical lyrics information and historical melody information corresponding to the same historical song.
  • the historical lyric information may accordingly include a historical word sequence and a historical syllable sequence.
  • the historical word sequence and the historical syllable sequence can be generated based on the lyrics of the historical song, and the generation method can refer to the method for generating the target word sequence and the target syllable sequence based on the target lyrics given above.
  • the historical melody information is melody information obtained based on the melody of historical songs, which may include at least one of octave, pitch, duration and pause.
  • the pitch (MIDI pitch), duration, pauses of each song can be stored in separate files.
  • the MIDI pitch has 128 semitones, so different songs will be distributed in different pitches.
  • the pitches of all songs can be transferred to the key of C major, and there are 128 optional pitches in C major.
  • the tones are compressed into 70 tones, and then the 70 tones are divided into 10 octaves and 7 scales according to the octave relationship.
  • the duration may be quantized to a musically meaningful relative duration, eg, the duration is set to 12 quantization values [0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 3.0, 4.0, 6.0, 8.0, 16.0, 32.0] , the musical meaning of each quantization value can be shown in Figure 2.
  • the pause can be set to 8 quantization values [0.0, 0.5, 1.0, 2.0, 4.0, 8.0, 16.0, 32.0], and the musical meaning of each quantization value can be as shown in FIG. 3 .
  • the autoencoder may be a conditional variational autoencoder (CVAE, Conditional AutoEncoder).
  • CVAE Conditional AutoEncoder
  • the autoencoder is trained to obtain an automatic encoder, which can include the following steps:
  • Input a set of training data into the encoder layer in the conditional variational auto-encoder used in this training, and obtain the mean and variance corresponding to the latent space determined by the encoder layer for the input training data;
  • conditional variational autoencoder is updated using the output results and the historical melody information used in this training, and the updated conditional variational autoencoder is used in the next training. Until the condition for stopping the training of the model is met, the trained autoencoder is obtained.
  • the latent space also known as the latent space, is the space constructed by the conditional variational autoencoder.
  • a set of training data is input into the encoder layer of the conditional variational auto-encoder used in this training.
  • the encoder layer is provided with a hidden layer for calculating the mean and variance. Therefore, it is possible to obtain
  • the encoder layer determines the mean and variance of the input training data, and generates a resampling result corresponding to the latent space based on the obtained mean, variance, and additional noise (generally random noise can be used) obtained from the latent space.
  • additional noise generally random noise can be used
  • a plurality of sub-networks can be set in the encoder layer, respectively for inputting lyrics-related content and melody-related content. For example, if the training data includes historical lyrics information and historical melody information, wherein the historical lyrics information includes historical words.
  • historical melody information includes octave, pitch, length and pause
  • 6 sub-networks can be set in the encoder layer, which are used to input historical word sequence, historical syllable sequence, octave, Pitch, Length and Pause.
  • the decoder layer in the conditional variational autoencoder can input the resampling results and conditional information into the decoder layer in the conditional variational autoencoder used in this training to obtain the output of the decoder layer.
  • Multiple sub-networks can be set in the decoder layer, which are respectively used to generate one of the melody information. For example, if the melody information includes octave, pitch, duration and pause, the decoder layer includes 4 sub-networks, and Used to generate octaves, pitches, durations, and pauses, respectively.
  • the condition information can use the historical lyrics information used in this training. Therefore, using the content related to the lyrics as condition information is beneficial to generate a melody corresponding to the semantics of the lyrics.
  • the loss value of this training can be calculated with the historical melody information used in this training.
  • the model stop training condition can be, for example, the loss value is greater than the preset threshold
  • the conditional variational autoencoder is updated by using the output result and the historical melody information used in this training, and the The updated conditional variational autoencoder is used in the next training until the model stops training condition is met.
  • the related manner of model updating belongs to the common knowledge in the art, and will not be repeated here.
  • the purpose of training the encoder layer of the conditional variational auto-encoder is to make the output mean and variance of the encoder layer as close as possible to the state where the mean value is 0 and the variance is 1.
  • the purpose of the decoder layer is to make the output result of the decoder layer (ie the actual melody information) as close as possible to the historical melody information used for training. Therefore, the encoder layer can be trained first, and after the encoder layer is trained, the parameters of the encoder layer are fixed, and the decoder layer is further trained.
  • both the encoder layer and the decoder layer include the transformer layer.
  • the transformer layer can better sense the integrity of the sequence.
  • the structure of the conditional variational autoencoder can be as shown in FIG. 4 .
  • the transformer layer is added to the sub-network that outputs octave and pitch. This is because octave and pitch have higher semantic information. Adding the transformer layer is beneficial to achieve better training effect.
  • the melody generation model is obtained by using the conditional variational auto-encoder, and the latent space mechanism in the conditional variational auto-encoder can be used to encode the learned melody in the latent space, and recombine into a new melody, so that the melody can be Generating a learning mechanism that is more suitable for neural networks is conducive to generating melodies with suitable pitch differences and reducing the generation of dissonant melodies.
  • the autoencoder is trained to obtain a trained autoencoder, which may include the following steps:
  • the decoder layer in the autoencoder is trained with a generative adversarial network and a discriminator to obtain a trained autoencoder.
  • the auto-encoder can be further trained by combining the Generative Adversarial Networks (GAN, Generative Adversarial Networks) and the discriminator (Discriminator) to improve the diversity of the results generated by the decoder layer.
  • GAN Generative Adversarial Networks
  • Discriminator discriminator
  • the decoder layer in the autoencoder can be trained using generative adversarial networks and discriminators.
  • the output of the discriminator is a floating-point number between 0 and 1.
  • one dataset can be used for two passes, with two loss computations, in the first pass, only conditional variational autoencoder inputs and outputs are used, and the calculated loss is used to update the network, That is, the training process given in the previous article.
  • the corresponding historical lyrics information and historical melody information are used as the real samples of the discriminator, and the dataset is marked as 0.
  • the sampling results are input into the discriminator as fake samples, and the discriminator loss value of 0 is regarded as the loss of the decoder layer, and the weights of the internal parameters of the decoder layer are updated by derivation to obtain the trained autoencoder.
  • the melody generation model is generated by combining the generative adversarial network in the process of training the autoencoder. Since the generative adversarial network will randomly sample in the latent space during the training process, it can effectively reduce the overfitting in the training process. In this way, the obtained melody generation model can generate different melodies with the same lyric information, and the melodies are more diverse.
  • the decoder layer can be extracted as a melody generation model.
  • the target lyric information may be input into the model as a condition of the melody generation model, so that the obtained melody information fits the semantic information of the target lyric.
  • step 12 may include the following steps:
  • the target lyric information and target noise are input into the melody generation model, and the melody information output by the melody generation model is obtained.
  • the target noise may be the noise sampled from the hidden space, or may be the actual noise that is actually collected. Therefore, the target noise is used as the resampling content, and the target lyric information is used as the condition of the melody generation model, which are jointly input into the melody generation model, so as to obtain the melody information output by the melody generation model. In this way, it is beneficial to obtain more realistic melody information.
  • step 13 a target melody corresponding to the target lyric information is synthesized according to the melody information.
  • the target lyric information for generating the melody is obtained, and the target lyric information is input into the melody generation model to obtain the melody information output by the melody generation model, and then according to the melody information, the target melody corresponding to the target lyric information is synthesized .
  • the melody generation model is obtained by training the autoencoder conditioned on the target lyric information. Therefore, using the target lyric information for generating the melody as the condition of the auto-encoder is beneficial to generate a melody corresponding to the semantics of the target lyric information, and the generated target melody is more harmonious.
  • the method provided by the present disclosure may further include the following steps:
  • the target song is synthesized.
  • the target song may be generated according to the target lyric information and the target melody based on the speech synthesis technology or the song synthesis technology. In this way, not only a melody corresponding to the lyrics can be obtained based on the lyrics, but also a song can be synthesized based on the generated melody, so as to obtain a whole song matching the content of the lyrics.
  • FIG. 5 is a block diagram of a melody generating apparatus provided according to an embodiment of the present disclosure. As shown in Figure 5, the device 50 includes:
  • the first acquisition module 51 is used to acquire target lyrics information for generating a melody
  • the melody information generation module 52 is used for inputting the target lyric information into the melody generation model to obtain the melody information output by the melody generation model, wherein the melody generation model is based on automatic coding conditioned on the target lyric information machine training;
  • the melody synthesis module 53 is configured to synthesize a target melody corresponding to the target lyric information according to the melody information.
  • the device 50 obtains the melody generation model through the following modules:
  • the second acquisition module is used to acquire multiple groups of training data, wherein each group of training data includes historical lyrics information and historical melody information corresponding to the same historical song;
  • a training module configured to train the auto-encoder according to the training data to obtain a trained auto-encoder, wherein the auto-encoder includes an encoder layer and a decoder layer;
  • a model generation module configured to obtain the melody generation model according to the decoder layer in the auto-encoder that has been trained.
  • the autoencoder is a conditional variational autoencoder
  • the training module is used to:
  • condition for stopping the training of the model If the condition for stopping the training of the model is not met, update the conditional variational autoencoder using the output result and the historical melody information used in this training, and use the updated conditional variational autoencoder for the next training , until the condition for stopping the training of the model is satisfied, and the trained autoencoder is obtained.
  • both the encoder layer and the decoder layer include a transformer layer.
  • the training module is configured to use a generative adversarial network and a discriminator to train the decoder layer in the auto-encoder to obtain a trained auto-encoder.
  • the melody information generation module 52 includes:
  • the melody information generation module 52 is configured to input the target lyric information and the target noise into the melody generation model to obtain melody information output by the melody generation model.
  • the device 50 further includes:
  • a song synthesis module used for synthesizing a target song according to the target lyric information and the target melody.
  • the lyrics information includes a word sequence and a syllable sequence corresponding to the word sequence
  • the melody information includes at least one of octave, pitch, duration, and pause.
  • Terminal devices in the embodiments of the present disclosure may include, but are not limited to, such as mobile phones, notebook computers, digital broadcast receivers, PDAs (personal digital assistants), PADs (tablets), PMPs (portable multimedia players), vehicle-mounted terminals (eg, mobile terminals such as in-vehicle navigation terminals), etc., and stationary terminals such as digital TVs, desktop computers, and the like.
  • the electronic device shown in FIG. 6 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.
  • an electronic device 600 may include a processing device (eg, a central processing unit, a graphics processor, etc.) 601 that may be loaded into random access according to a program stored in a read only memory (ROM) 602 or from a storage device 608 Various appropriate actions and processes are executed by the programs in the memory (RAM) 603 . In the RAM 603, various programs and data necessary for the operation of the electronic device 600 are also stored.
  • the processing device 601, the ROM 602, and the RAM 603 are connected to each other through a bus 604.
  • An input/output (I/O) interface 605 is also connected to bus 604 .
  • I/O interface 605 input devices 606 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 607 of a computer, etc.; a storage device 608 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 609.
  • Communication means 609 may allow electronic device 600 to communicate wirelessly or by wire with other devices to exchange data. While FIG. 6 shows electronic device 600 having various means, it should be understood that not all of the illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product comprising a computer program carried on a non-transitory computer readable medium, the computer program containing program code for performing the method illustrated in the flowchart.
  • the computer program may be downloaded and installed from the network via the communication device 609, or from the storage device 608, or from the ROM 602.
  • the processing apparatus 601 the above-mentioned functions defined in the methods of the embodiments of the present disclosure are executed.
  • the computer-readable medium mentioned above in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium, or any combination of the above two.
  • the computer-readable storage medium can be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples of computer readable storage media may include, but are not limited to, electrical connections having one or more wires, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable Programmable read only memory (EPROM or flash memory), fiber optics, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave with computer-readable program code embodied thereon. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium can also be any computer-readable medium other than a computer-readable storage medium that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted using any suitable medium including, but not limited to, electrical wire, optical fiber cable, RF (radio frequency), etc., or any suitable combination of the foregoing.
  • the server can communicate using any currently known or future developed network protocol, such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium (eg, , communication network) interconnection.
  • HTTP HyperText Transfer Protocol
  • Examples of communication networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (eg, the Internet), and peer-to-peer networks (eg, ad hoc peer-to-peer networks), as well as any currently known or future development network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; or may exist alone without being assembled into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: obtains target lyric information for generating a melody; inputs the target lyric information into a In the melody generation model, the melody information output by the melody generation model is obtained, wherein the melody generation model is obtained by training an autoencoder conditioned on target lyric information; The target melody corresponding to the lyrics information.
  • Computer program code for performing operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and This includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer, or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer (eg, using an Internet service provider to via Internet connection).
  • LAN local area network
  • WAN wide area network
  • each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more functions for implementing the specified logical function(s) executable instructions.
  • the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented in dedicated hardware-based systems that perform the specified functions or operations , or can be implemented in a combination of dedicated hardware and computer instructions.
  • the modules involved in the embodiments of the present disclosure may be implemented in software or hardware.
  • the name of the module does not constitute a limitation of the module itself in some cases, for example, the first acquisition module can also be described as "a module for acquiring target lyrics information for generating a melody”.
  • exemplary types of hardware logic components include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), Systems on Chips (SOCs), Complex Programmable Logical Devices (CPLDs) and more.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLDs Complex Programmable Logical Devices
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device.
  • the machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • a method for generating a melody comprising:
  • the target lyric information is input into the melody generation model, and the melody information output by the melody generation model is obtained, wherein the melody generation model is obtained according to the training of an autoencoder conditioned on the target lyric information;
  • a target melody corresponding to the target lyric information is synthesized.
  • a melody generation method is provided, and the melody generation model is obtained in the following manner:
  • each set of training data includes historical lyrics information and historical melody information corresponding to the same historical song
  • the automatic encoder is trained to obtain a trained automatic encoder, wherein the automatic encoder includes an encoder layer and a decoder layer;
  • the melody generation model is obtained according to the decoder layer in the trained auto-encoder.
  • a method for generating a melody wherein the autoencoder is a conditional variational autoencoder
  • the automatic encoder is trained to obtain the trained automatic encoder, including:
  • condition for stopping the training of the model If the condition for stopping the training of the model is not met, update the conditional variational autoencoder using the output result and the historical melody information used in this training, and use the updated conditional variational autoencoder for the next training , until the condition for stopping the training of the model is satisfied, and the trained autoencoder is obtained.
  • a method for generating a melody wherein both the encoder layer and the decoder layer include a transformer layer.
  • a method for generating a melody includes:
  • the decoder layer in the autoencoder is trained using a generative adversarial network and a discriminator to obtain a trained autoencoder.
  • a melody generation method wherein the inputting the target lyric information into a melody generation model to obtain the melody information output by the melody generation model includes:
  • the target lyric information and the target noise are input into the melody generation model to obtain melody information output by the melody generation model.
  • a method for generating a melody comprising:
  • a target song is synthesized.
  • a method for generating a melody wherein the lyrics information includes a word sequence and a syllable sequence corresponding to the word sequence;
  • the melody information includes at least one of octave, pitch, duration, and pause.
  • a melody generating apparatus comprising:
  • the first acquisition module is used to acquire the target lyrics information for generating the melody
  • the melody information generation module is used for inputting the target lyric information into the melody generation model to obtain the melody information output by the melody generation model, wherein the melody generation model is based on an automatic encoder conditioned on the target lyric information trained;
  • a melody synthesis module configured to synthesize a target melody corresponding to the target lyric information according to the melody information.
  • a computer-readable medium on which a computer program is stored, and when the program is executed by a processing apparatus, implements the steps of the melody generation method described in any embodiment of the present disclosure.
  • an electronic device comprising:
  • a processing device is configured to execute the computer program in the storage device to implement the steps of the melody generation method described in any embodiment of the present disclosure.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Auxiliary Devices For Music (AREA)

Abstract

一种旋律生成方法、装置、可读介质、电子设备及计算机程序产品,该方法包括:获取用于生成旋律的目标歌词信息(11);将目标歌词信息输入至旋律生成模型中,得到旋律生成模型输出的旋律信息(12),其中,旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;根据旋律信息,合成与目标歌词信息对应的目标旋律(13)。

Description

旋律生成方法、装置、可读介质及电子设备
相关申请的交叉引用
本申请是以申请号为202011349370.8,申请日为2020年11月26日的中国申请为基础,并主张其优先权,该中国申请的公开内容在此作为整体引入本申请中。
技术领域
本公开涉及语音合成领域,具体地,涉及一种旋律生成方法、装置、可读介质及电子设备。
背景技术
基于歌词的旋律生成一直是计算机音乐领域的热门话题。由于歌词中不含音乐性的信息,且神经网络本身没有常识,无法将各种情感和音乐知识结合起来,创作出合适的旋律,因此,利用神经网络转换这些不相关的语义信息是一个巨大难题。并且,由于歌词本身不是音乐性的,也不是音乐理论系统的一部分,所以易产生不符合主观音乐审美的旋律。
发明内容
提供该发明内容部分以便以简要的形式介绍构思,这些构思将在后面的具体实施方式部分被详细描述。该发明内容部分并不旨在标识要求保护的技术方案的关键特征或必要特征,也不旨在用于限制所要求的保护的技术方案的范围。
第一方面,本公开提供一种旋律生成方法,所述方法包括:
获取用于生成旋律的目标歌词信息;
将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
第二方面,本公开提供一种旋律生成装置,所述装置包括:
第一获取模块,用于获取用于生成旋律的目标歌词信息;
旋律信息生成模块,用于将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
旋律合成模块,用于根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
第三方面,本公开提供一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开第一方面所述方法的步骤。
第四方面,本公开提供一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开第一方面所述方法的步骤。
通过上述技术方案,获取用于生成旋律的目标歌词信息,并将目标歌词信息输入至旋律生成模型中,得到旋律生成模型输出的旋律信息,再根据旋律信息,合成与目标歌词信息对应的目标旋律。其中,旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的。
本公开的其他特征和优点将在随后的具体实施方式部分予以详细说明。
附图说明
结合附图并参考以下具体实施方式,本公开各实施例的上述和其他特征、优点及方面将变得更加明显。贯穿附图中,相同或相似的附图标记表示相同或相似的元素。应当理解附图是示意性的,原件和元素不一定按照比例绘制。在附图中:
图1是根据本公开的一种实施方式提供的旋律生成方法的流程图;
图2是根据本公开提供的旋律生成方法中,时长量化值音乐含义的一种示例性的示意图;
图3是根据本公开提供的旋律生成方法中,停顿量化值音乐含义的一种示例性的示意图;
图4是根据本公开提供的旋律生成方法中,条件变分自动编码器的结构的一种示例性的示意图;
图5是根据本公开的一种实施方式提供的旋律生成装置的框图;
图6示出了适于用来实现本公开实施例的电子设备的结构示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
图1是根据本公开的一种实施方式提供的旋律生成方法的流程图。如图1所示,该方法可以包括以下步骤:
在步骤11中,获取用于生成旋律的目标歌词信息;
在步骤12中,将目标歌词信息输入至旋律生成模型中,得到旋律生成模型输出的旋律信息;
在步骤13中,根据旋律信息,合成与目标歌词对应的目标旋律。
其中,旋律生成模型是根据以目标歌词为条件的自动编码器训练得到的。
本公开的目的在于基于目标歌词信息,生成符合该目标歌词信息的旋律。
在本公开中,歌词信息可以包括单词序列以及与单词序列对应的音节序列。相应地,目标歌词信息可以包括目标单词序列以及与目标单词序列对应的目标音节序列。
示例地,对于用来生成旋律的一段歌词(后文将称为目标歌词),可以首先对目标歌词进行分词处理,并按照各分词在目标歌词中的出现顺序生成目标单词序列。再例如,在获得目标单词序列后,可以使用预设的单词向量字典确定目标单词序列中各单词的音节,并按照各单词在目标单词序列中的顺序排列各单词对应的音节,以生成与目标单词序列对应的目标音节序列。其中,在单词向量字典中,每个单词和一个音节被编码为一单词向量,并将每个音节与其对应的单词对齐。
在获得目标单词序列和目标音节序列后,可以将目标单词序列和目标音节序列输入至旋律生成模型中,得到旋律生成模型输出的旋律信息。旋律信息能够表征旋律是什么样的, 从而基于旋律信息中包含的内容容易合成出一段旋律。其中,旋律信息可以包括八度、音高、音长和停顿中的至少一种。
在一种可能的实施方式中,旋律生成模型可以通过如下方式获得:
获取多组训练数据;
根据训练数据,对自动编码器进行训练,得到训练完成的自动编码器;
根据训练完成的自动编码器中的解码器层,得到旋律生成模型。
其中,自动编码器包括编码器层和解码器层。
每一组训练数据包括同一历史歌曲对应的历史歌词信息和历史旋律信息。如上文所述的对于歌词信息的释义,历史歌词信息可以相应包括历史单词序列和历史音节序列。历史单词序列和历史音节序列可以基于历史歌曲的歌词生成,生成方式可以参考前文给出的基于目标歌词生成目标单词序列和目标音节序列的方式。
历史旋律信息是基于历史歌曲的旋律得到的旋律信息,其中可以包含八度、音高、音长和停顿中的至少一种。举例来说,每一首歌曲的音高(MIDI音高)、时长、停顿可以分别存储在单独的文件中。其中,MIDI音高有128个半音,所以不同的歌曲会以不同的音调分布,为了方便神经网络的学习,可以将所有歌曲的音调都转移到C大调,C大调中将128个可选音压缩为70个音调,然后根据八度关系将70个音调分为10个八度音阶和7个音阶。示例地,可以将时长量化为具有音乐意义的相对持续时长,例如,将时长设置12个量化值[0.25、0.5、0.75、1.0、1.5、2.0、3.0、4.0、6.0、8.0、16.0、32.0],每种量化值的音乐含义可以如图2所示。示例地,可以将停顿设置8个量化值[0.0、0.5、1.0、2.0、4.0、8.0、16.0、32.0],每种量化值的音乐含义可以如图3所示。
在一种可能的实施方式中,自动编码器可以为条件变分自动编码器(CVAE,Conditional AutoEncoder),在这一实施方式中,根据训练数据,对自动编码器进行训练,得到训练完成的自动编码器,可以包括以下步骤:
将一组训练数据输入至本次训练所使用的条件变分自动编码器中的编码器层,获得编码器层针对输入的训练数据所确定的对应于隐空间的均值和方差;
根据均值、方差和附加噪声,生成对应于隐空间的重采样结果;
将重采样结果和条件信息输入至本次训练所使用的条件变分自动编码器中的解码器层,获得解码器层的输出结果;
在不满足模型停止训练条件的情况下,利用输出结果和本次训练所使用的历史旋律信息更新条件变分自动编码器,并将更新后的条件变分自动编码器用于下一次的训练中,直至满足模型停止训练条件,得到训练完成的自动编码器。
隐空间也可称作潜在空间,它是条件变分自动编码器所构建出的空间。
在一次训练过程中,将一组训练数据输入至本次训练所使用的条件变分自动编码器中的编码器层,编码器层中设置有用于计算均值和方差的隐层,因此,能够获得编码器层针对输入的训练数据所确定的均值和方差,基于得到的均值、方差以及从隐空间获得的附件噪声(一般可以使用随机噪声),生成对应于隐空间的重采样结果。示例地,编码器层中可以设置多个子网络,分别用于输入歌词相关内容和旋律相关内容,举例来说,若训练数据中包含历史歌词信息和历史旋律信息,其中,历史歌词信息包括历史单词序列和历史音节序列,历史旋律信息包括八度、音高、音长和停顿这四者,则编码器层中可以设置6个子网络,分别用于输入历史单词序列、历史音节序列、八度、音高、音长和停顿。
之后,可以使用条件变分自动编码器中的解码器层,将重采样结果和条件信息输入至本次训练所使用的条件变分自动编码器中的解码器层,获得解码器层的输出结果。解码器层中可以设置有多个子网络,分别用于生成旋律信息中的一者,例如,若旋律信息包括八度、音高、音长和停顿,则解码器层中包含4个子网络,并分别用于生成八度、音高、音长和停顿。其中,条件信息可以使用本次训练所使用的历史歌词信息。由此,将歌词相关内容作为条件信息,有利于生成与歌词语义相对应的旋律。
基于解码器的输出结果,可以与本次训练所使用的历史旋律信息计算本次训练的损失值。在不满足模型停止训练条件(模型停止训练条件可以例如为,损失值大于预设阈值)的情况下,利用输出结果和本次训练所使用的历史旋律信息更新条件变分自动编码器,并将更新后的条件变分自动编码器用于下一次的训练中,直至满足模型停止训练条件。其中,模型更新的相关方式属于本领域的公知常识,此处不赘述。
在训练过程中,训练条件变分自动编码器的编码器层的目的在于使编码器层输出的均值、方差尽可能地接近均值为0且方差为1的状态,训练条件变分自动编码器的解码器层的目的则在于使解码器层的输出结果(即实际得到的旋律信息)尽可能地接近训练所使用的历史旋律信息。因此,可以首先训练编码器层,在编码器层训练完毕后,固定编码器层的参数,进一步训练解码器层。
其中,编码器层和解码器层中均包含transformer层。这样,transformer层可以更好地感测序列的完整性。示例地,条件变分自动编码器的结构可以如图4所示。在图4中,在解码器层,在输出八度和音高的子网络中又各自增加了transformer层,这是由于八度、音高具有较高的语义信息,增加transformer层有利于达到更好的训练效果。
通过上述方式,利用条件变分自动编码器得到旋律生成模型,能够利用条件变分自动编码器中潜在空间的机制,在潜在空间上对学习到的旋律进行编码,并重组成新的旋律, 使旋律生成更适合于神经网络的学习机制,有利于生成音高差合适的旋律,减少生成不和谐旋律。
可选地,根据训练数据,对自动编码器进行训练,得到训练完成的自动编码器,可以包括以下步骤:
利用生成式对抗网络和鉴别器对自动编码器中的解码器层进行训练,以得到训练完成的自动编码器。
在训练自动编码器的过程中,还可以进一步结合生成式对抗网络(GAN,Generative Adversarial Networks)和鉴别器(Discriminator)对自动编码器进行训练,以提升解码器层生成结果的多样性。其中,可以利用生成式对抗网络和鉴别器对自动编码器中的解码器层进行训练。
鉴别器的输出为0到1之间的浮点数,鉴别器的输出越接近0,则鉴别器认为输入越真实,而鉴别器的输出越接近1,鉴别器认为输入越虚假。示例地,可以使用一个数据集进行两次处理,进行两次损失(loss)计算,第一次处理中,仅使用条件变分自动编码器输入和输出,并且使用计算出的损耗来更新网络,也就是前文中给出的训练过程,第二次处理中,将对应的历史歌词信息和历史旋律信息用作鉴别器的真实样本,并将数据集标注为0,通过在潜在空间中进行采样,采样结果作为虚假样本输入至鉴别器中,将鉴别器损失值为0视为解码器层的损失,通过推导方式更新解码器层内部参数的权重,以得到训练完成的自动编码器。
通过上述方式,在训练自动编码器的过程中结合生成式对抗网络生成旋律生成模型,由于生成式对抗网络在训练过程中会在潜在空间随机抽样,因此可以有效减少训练过程中的过拟合。这样,获得的旋律生成模型能以相同的歌词信息生成不同的旋律,旋律更多样。
在获得训练完成的自动编码器后,可以将其中的解码器层提取出来,作为旋律生成模型。
在一种可能的实施方式中,步骤12中,可以将目标歌词信息作为旋律生成模型的条件输入至模型中,以使获得的旋律信息贴合目标歌词的语义信息。
在另一种可能的实施方式中,步骤12可以包括以下步骤:
获取目标噪声;
将目标歌词信息和目标噪声输入至旋律生成模型中,得到旋律生成模型输出的旋律信息。
示例地,目标噪声可以为采样自隐空间的噪声,也可以为实际采集的实际噪声。从而,将目标噪声作为重采样内容,并将目标歌词信息作为旋律生成模型的条件,共同输入至旋 律生成模型中,以得到旋律生成模型输出的旋律信息。这样,有利于获得更为真实的旋律信息。
在步骤13中,根据旋律信息,合成与目标歌词信息对应的目标旋律。
如上所述,根据旋律信息中的八度、音高、音长和停顿中的至少一种,容易合成出一段旋律,因此,基于旋律信息,即可获得一段旋律,作为与目标歌词信息对应的目标旋律。
通过上述技术方案,获取用于生成旋律的目标歌词信息,并将目标歌词信息输入至旋律生成模型中,得到旋律生成模型输出的旋律信息,再根据旋律信息,合成与目标歌词信息对应的目标旋律。其中,旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的。由此,将用于生成旋律的目标歌词信息作为自动编码器的条件,有利于生成与目标歌词信息语义相对应的旋律,生成的目标旋律更加和谐。
可选地,在图1所示各步骤的基础上,本公开提供的方法还可以包括以下步骤:
根据目标歌词信息和目标旋律,合成目标歌曲。
基于目标歌词信息生成目标旋律后,可以进一步基于语音合成技术或歌曲合成技术,根据目标歌词信息和目标旋律生成目标歌曲。这样,不仅能够基于歌词获得符合该歌词的旋律,还能进一步基于生成的旋律合成歌曲,以获得符合歌词内容的整首歌曲。
图5是根据本公开的一种实施方式提供的旋律生成装置的框图。如图5所示,该装置50包括:
第一获取模块51,用于获取用于生成旋律的目标歌词信息;
旋律信息生成模块52,用于将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
旋律合成模块53,用于根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
可选地,所述装置50通过如下模块获得所述旋律生成模型:
第二获取模块,用于获取多组训练数据,其中,每一组训练数据包括同一历史歌曲对应的历史歌词信息和历史旋律信息;
训练模块,用于根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,其中,所述自动编码器包括编码器层和解码器层;
模型生成模块,用于根据所述训练完成的自动编码器中的解码器层,得到所述旋律生成模型。
可选地,所述自动编码器为条件变分自动编码器;
所述训练模块用于:
将一组所述训练数据输入至本次训练所使用的条件变分自动编码器中的编码器层,获得所述编码器层针对输入的训练数据所确定的对应于隐空间的均值和方差;
根据所述均值、所述方差和附加噪声,生成对应于隐空间的重采样结果;
将所述重采样结果和条件信息输入至本次训练所使用的条件变分自动编码器中的解码器层,获得所述解码器层的输出结果,其中,所述条件信息为本次训练所使用的历史歌词信息;
在不满足模型停止训练条件的情况下,利用所述输出结果和本次训练所使用的历史旋律信息更新条件变分自动编码器,并将更新后的条件变分自动编码器用于下一次的训练中,直至满足所述模型停止训练条件,得到训练完成的自动编码器。
可选地,所述编码器层和所述解码器层中均包含transformer层。
可选地,所述训练模块用于利用生成式对抗网络和鉴别器对所述自动编码器中的解码器层进行训练,以得到训练完成的自动编码器。
可选地,所述旋律信息生成模块52包括:
获取子模块,用于获取目标噪声;
所述旋律信息生成模块52用于将所述目标歌词信息和所述目标噪声输入至所述旋律生成模型中,得到所述旋律生成模型输出的旋律信息。
可选地,所述装置50还包括:
歌曲合成模块,用于根据所述目标歌词信息和所述目标旋律,合成目标歌曲。
可选地,歌词信息包括单词序列以及与所述单词序列对应的音节序列;
旋律信息包括八度、音高、音长和停顿中的至少一种。
关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。
下面参考图6,其示出了适于用来实现本公开实施例的电子设备600的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图6示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。
如图6所示,电子设备600可以包括处理装置(例如中央处理器、图形处理器等)601,其可以根据存储在只读存储器(ROM)602中的程序或者从存储装置608加载到随机访问存储器(RAM)603中的程序而执行各种适当的动作和处理。在RAM 603中,还存储 有电子设备600操作所需的各种程序和数据。处理装置601、ROM 602以及RAM 603通过总线604彼此相连。输入/输出(I/O)接口605也连接至总线604。
通常,以下装置可以连接至I/O接口605:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置606;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置607;包括例如磁带、硬盘等的存储装置608;以及通信装置609。通信装置609可以允许电子设备600与其他设备进行无线或有线通信以交换数据。虽然图6示出了具有各种装置的电子设备600,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置609从网络上被下载和安装,或者从存储装置608被安装,或者从ROM 602被安装。在该计算机程序被处理装置601执行时,执行本公开实施例的方法中限定的上述功能。
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或 介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:获取用于生成旋律的目标歌词信息;将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言——诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)——连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的模块可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,模块的名称在某种情况下并不构成对该模块本身的限定,例如,第一获取模块还可以被描述为“获取用于生成旋律的目标歌词信息的模块”。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非 限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述方法包括:
获取用于生成旋律的目标歌词信息;
将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述旋律生成模型通过如下方式获得:
获取多组训练数据,其中,每一组训练数据包括同一历史歌曲对应的历史歌词信息和历史旋律信息;
根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,其中,所述自动编码器包括编码器层和解码器层;
根据所述训练完成的自动编码器中的解码器层,得到所述旋律生成模型。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述自动编码器为条件变分自动编码器;
所述根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,包括:
将一组所述训练数据输入至本次训练所使用的条件变分自动编码器中的编码器层,获得所述编码器层针对输入的训练数据所确定的对应于隐空间的均值和方差;
根据所述均值、所述方差和附加噪声,生成对应于隐空间的重采样结果;
将所述重采样结果和条件信息输入至本次训练所使用的条件变分自动编码器中的解 码器层,获得所述解码器层的输出结果,其中,所述条件信息为本次训练所使用的历史歌词信息;
在不满足模型停止训练条件的情况下,利用所述输出结果和本次训练所使用的历史旋律信息更新条件变分自动编码器,并将更新后的条件变分自动编码器用于下一次的训练中,直至满足所述模型停止训练条件,得到训练完成的自动编码器。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述编码器层和所述解码器层中均包含transformer层。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,包括:
利用生成式对抗网络和鉴别器对所述自动编码器中的解码器层进行训练,以得到训练完成的自动编码器。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,包括:
获取目标噪声;
将所述目标歌词信息和所述目标噪声输入至所述旋律生成模型中,得到所述旋律生成模型输出的旋律信息。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,所述方法还包括:
根据所述目标歌词信息和所述目标旋律,合成目标歌曲。
根据本公开的一个或多个实施例,提供了一种旋律生成方法,歌词信息包括单词序列以及与所述单词序列对应的音节序列;
旋律信息包括八度、音高、音长和停顿中的至少一种。
根据本公开的一个或多个实施例,提供了一种旋律生成装置,所述装置包括:
第一获取模块,用于获取用于生成旋律的目标歌词信息;
旋律信息生成模块,用于将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
旋律合成模块,用于根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
根据本公开的一个或多个实施例,提供了一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现本公开任意实施例所述的旋律生成方法的步骤。
根据本公开的一个或多个实施例,提供了一种电子设备,包括:
存储装置,其上存储有计算机程序;
处理装置,用于执行所述存储装置中的所述计算机程序,以实现本公开任意实施例所述的旋律生成方法的步骤。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。关于上述实施例中的装置,其中各个模块执行操作的具体方式已经在有关该方法的实施例中进行了详细描述,此处将不做详细阐述说明。

Claims (12)

  1. 一种旋律生成方法,所述方法包括:
    获取用于生成旋律的目标歌词信息;
    将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
    根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
  2. 根据权利要求1所述的方法,其中所述旋律生成模型通过如下方式获得:
    获取多组训练数据,其中,每一组训练数据包括同一历史歌曲对应的历史歌词信息和历史旋律信息;
    根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,其中,所述自动编码器包括编码器层和解码器层;
    根据所述训练完成的自动编码器中的解码器层,得到所述旋律生成模型。
  3. 根据权利要求2所述的方法,其中所述自动编码器为条件变分自动编码器;
    所述根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,包括:
    将一组所述训练数据输入至本次训练所使用的条件变分自动编码器中的编码器层,获得所述编码器层针对输入的训练数据所确定的对应于隐空间的均值和方差;
    根据所述均值、所述方差和附加噪声,生成对应于隐空间的重采样结果;
    将所述重采样结果和条件信息输入至本次训练所使用的条件变分自动编码器中的解码器层,获得所述解码器层的输出结果,其中,所述条件信息为本次训练所使用的历史歌词信息;
    在不满足模型停止训练条件的情况下,利用所述输出结果和本次训练所使用的历史旋律信息更新条件变分自动编码器,并将更新后的条件变分自动编码器用于下一次的训练中,直至满足所述模型停止训练条件,得到训练完成的自动编码器。
  4. 根据权利要求3所述的方法,其中所述编码器层和所述解码器层中均包含transformer层。
  5. 根据权利要求2或3所述的方法,其中所述根据所述训练数据,对所述自动编码器进行训练,得到训练完成的自动编码器,包括:
    利用生成式对抗网络和鉴别器对所述自动编码器中的解码器层进行训练,以得到训练完成的自动编码器。
  6. 根据权利要求1所述的方法,其中所述将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,包括:
    获取目标噪声;
    将所述目标歌词信息和所述目标噪声输入至所述旋律生成模型中,得到所述旋律生成模型输出的旋律信息。
  7. 根据权利要求1所述的方法,其中所述方法还包括:
    根据所述目标歌词信息和所述目标旋律,合成目标歌曲。
  8. 根据权利要求1-7中任一项所述的方法,其中,
    歌词信息包括单词序列以及与所述单词序列对应的音节序列;
    旋律信息包括八度、音高、音长和停顿中的至少一种。
  9. 一种旋律生成装置,所述装置包括:
    第一获取模块,被配置为获取用于生成旋律的目标歌词信息;
    旋律信息生成模块,被配置为将所述目标歌词信息输入至旋律生成模型中,得到所述旋律生成模型输出的旋律信息,其中,所述旋律生成模型是根据以目标歌词信息为条件的自动编码器训练得到的;
    旋律合成模块,被配置为根据所述旋律信息,合成与所述目标歌词信息对应的目标旋律。
  10. 一种计算机可读介质,其上存储有计算机程序,该程序被处理装置执行时实现权利要求1-8中任一项所述方法的步骤。
  11. 一种电子设备,包括:
    存储装置,其上存储有计算机程序;
    处理装置,被配置为执行所述存储装置中的所述计算机程序,以实现权利要求1-8中 任一项所述方法的步骤。
  12. 一种计算机程序产品,包括计算机程序,所述计算机程序在被处理装置执行时实现权利要求1-8中任一项所述方法的步骤。
PCT/CN2021/128322 2020-11-26 2021-11-03 旋律生成方法、装置、可读介质及电子设备 WO2022111242A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011349370.8A CN112489606B (zh) 2020-11-26 2020-11-26 旋律生成方法、装置、可读介质及电子设备
CN202011349370.8 2020-11-26

Publications (1)

Publication Number Publication Date
WO2022111242A1 true WO2022111242A1 (zh) 2022-06-02

Family

ID=74935194

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/128322 WO2022111242A1 (zh) 2020-11-26 2021-11-03 旋律生成方法、装置、可读介质及电子设备

Country Status (2)

Country Link
CN (1) CN112489606B (zh)
WO (1) WO2022111242A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489606B (zh) * 2020-11-26 2022-09-27 北京有竹居网络技术有限公司 旋律生成方法、装置、可读介质及电子设备
CN113066456B (zh) * 2021-03-17 2023-09-29 平安科技(深圳)有限公司 基于柏林噪声的旋律生成方法、装置、设备及存储介质
CN113035161A (zh) * 2021-03-17 2021-06-25 平安科技(深圳)有限公司 基于和弦的歌曲旋律生成方法、装置、设备及存储介质
CN113066459B (zh) * 2021-03-24 2023-05-30 平安科技(深圳)有限公司 基于旋律的歌曲信息合成方法、装置、设备及存储介质
CN112951187B (zh) * 2021-03-24 2023-11-03 平安科技(深圳)有限公司 梵呗音乐生成方法、装置、设备及存储介质

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180322854A1 (en) * 2017-05-08 2018-11-08 WaveAI Inc. Automated Melody Generation for Songwriting
CN108806656A (zh) * 2017-04-26 2018-11-13 微软技术许可有限责任公司 歌曲的自动生成
CN109166564A (zh) * 2018-07-19 2019-01-08 平安科技(深圳)有限公司 为歌词文本生成乐曲的方法、装置及计算机可读存储介质
CN110362696A (zh) * 2019-06-11 2019-10-22 平安科技(深圳)有限公司 歌词生成方法、系统、计算机设备及计算机可读存储介质
CN110853604A (zh) * 2019-10-30 2020-02-28 西安交通大学 基于变分自编码器的具有特定地域风格的中国民歌自动生成方法
CN111492424A (zh) * 2018-10-19 2020-08-04 索尼公司 信息处理设备、信息处理方法以及信息处理程序
CN111554267A (zh) * 2020-04-23 2020-08-18 北京字节跳动网络技术有限公司 音频合成方法、装置、电子设备和计算机可读介质
CN112489606A (zh) * 2020-11-26 2021-03-12 北京有竹居网络技术有限公司 旋律生成方法、装置、可读介质及电子设备

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087734A1 (en) * 2016-03-28 2019-03-21 Sony Corporation Information processing apparatus and information processing method
CN108461079A (zh) * 2018-02-02 2018-08-28 福州大学 一种面向音色转换的歌声合成方法
CN110555126B (zh) * 2018-06-01 2023-06-27 微软技术许可有限责任公司 旋律的自动生成
CN111724809A (zh) * 2020-06-15 2020-09-29 苏州意能通信息技术有限公司 一种基于变分自编码器的声码器实现方法及装置
CN111739492B (zh) * 2020-06-18 2023-07-11 南京邮电大学 一种基于音高轮廓曲线的音乐旋律生成方法

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108806656A (zh) * 2017-04-26 2018-11-13 微软技术许可有限责任公司 歌曲的自动生成
US20180322854A1 (en) * 2017-05-08 2018-11-08 WaveAI Inc. Automated Melody Generation for Songwriting
CN109166564A (zh) * 2018-07-19 2019-01-08 平安科技(深圳)有限公司 为歌词文本生成乐曲的方法、装置及计算机可读存储介质
CN111492424A (zh) * 2018-10-19 2020-08-04 索尼公司 信息处理设备、信息处理方法以及信息处理程序
CN110362696A (zh) * 2019-06-11 2019-10-22 平安科技(深圳)有限公司 歌词生成方法、系统、计算机设备及计算机可读存储介质
CN110853604A (zh) * 2019-10-30 2020-02-28 西安交通大学 基于变分自编码器的具有特定地域风格的中国民歌自动生成方法
CN111554267A (zh) * 2020-04-23 2020-08-18 北京字节跳动网络技术有限公司 音频合成方法、装置、电子设备和计算机可读介质
CN112489606A (zh) * 2020-11-26 2021-03-12 北京有竹居网络技术有限公司 旋律生成方法、装置、可读介质及电子设备

Also Published As

Publication number Publication date
CN112489606A (zh) 2021-03-12
CN112489606B (zh) 2022-09-27

Similar Documents

Publication Publication Date Title
WO2022111242A1 (zh) 旋律生成方法、装置、可读介质及电子设备
WO2022105545A1 (zh) 语音合成方法、装置、可读介质及电子设备
WO2022151931A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
CN111583900B (zh) 歌曲合成方法、装置、可读介质及电子设备
CN111369967B (zh) 基于虚拟人物的语音合成方法、装置、介质及设备
CN111402842B (zh) 用于生成音频的方法、装置、设备和介质
WO2022105553A1 (zh) 语音合成方法、装置、可读介质及电子设备
WO2022151930A1 (zh) 语音合成方法、合成模型训练方法、装置、介质及设备
CN111798821B (zh) 声音转换方法、装置、可读存储介质及电子设备
WO2022143058A1 (zh) 语音识别方法、装置、存储介质及电子设备
CN111899720A (zh) 用于生成音频的方法、装置、设备和介质
CN111724807B (zh) 音频分离方法、装置、电子设备及计算机可读存储介质
WO2022156464A1 (zh) 语音合成方法、装置、可读介质及电子设备
CN111354343B (zh) 语音唤醒模型的生成方法、装置和电子设备
WO2022037388A1 (zh) 语音生成方法、装置、设备和计算机可读介质
CN111782576B (zh) 背景音乐的生成方法、装置、可读介质、电子设备
WO2022237665A1 (zh) 语音合成方法、装置、电子设备和存储介质
WO2022042418A1 (zh) 音乐合成方法、装置、设备和计算机可读介质
CN112562633A (zh) 一种歌唱合成方法、装置、电子设备及存储介质
CN111429881B (zh) 语音合成方法、装置、可读介质及电子设备
WO2023179506A1 (zh) 韵律预测方法、装置、可读介质及电子设备
Wang et al. Attention‐based neural network for end‐to‐end music separation
CN111653261A (zh) 语音合成方法、装置、可读存储介质及电子设备
Gabrielli et al. Deep learning for timbre modification and transfer: An evaluation study
CN112382273A (zh) 用于生成音频的方法、装置、设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21896743

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21896743

Country of ref document: EP

Kind code of ref document: A1