CN109801608A - A kind of song generation method neural network based and system - Google Patents
A kind of song generation method neural network based and system Download PDFInfo
- Publication number
- CN109801608A CN109801608A CN201811550908.4A CN201811550908A CN109801608A CN 109801608 A CN109801608 A CN 109801608A CN 201811550908 A CN201811550908 A CN 201811550908A CN 109801608 A CN109801608 A CN 109801608A
- Authority
- CN
- China
- Prior art keywords
- phoneme
- audio
- song
- sample
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Auxiliary Devices For Music (AREA)
Abstract
A kind of song generation method neural network based, comprising the following steps: obtain lyrics text and determine singer;Phoneme is extracted from the lyrics text;Each phoneme corresponding duration and fundamental frequency are predicted according to phoneme prediction model, wherein the phoneme prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;The phoneme, duration and fundamental frequency are combined, target song audio is obtained.The application also proposes that a kind of song neural network based corresponding with the above method generates system.
Description
Technical field
The present invention relates to field of neural networks more particularly to a kind of song generation methods neural network based and system.
Background technique
In development of games and video display field, the demand to the songs such as subject areas or piece caudal flexure is growing day by day, increasingly shape one
Mature industrial chain.
For the company of development of games and video display, the most of performance expense from well-known singer of the cost of song
With.Since most of medium-sized and small enterprises are difficult to bear the performance expense of great number, have to take the second best, select the not high singer of popularity into
Row is sung, and professional standards then cannot be guaranteed.
Therefore, the cost of manufacture for how saving the songs such as theme song or piece caudal flexure in development of games and video display field becomes
The company of development of games and video display needs the problem of facing.
Summary of the invention
The purpose of the application is to solve the deficiencies in the prior art, provide a kind of song generation method neural network based and
System can obtain the effect for reducing song cost of manufacture and shortening song fabrication cycle.
To achieve the goals above, the following technical solution is employed by the application.
Firstly, the application proposes a kind of song generation method neural network based, suitable for being automatically generated according to the lyrics
Song.Method includes the following steps:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the sound
Plain prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
Further, in the above method of the application, the step S100 further includes following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
Further, in the above method of the application, the step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set
Storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
Further, in the above method of the application, the step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer
This;
S302 the first phoneme sample) is extracted from the samples of text, and according to parted pattern by the audio file sample
This segmentation obtains the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with
Construct training sample set;
S304) according to the training sample set training phoneme prediction model;
S305) predict that each phoneme is corresponding in the lyrics text according to the phoneme prediction model that training obtains
Duration and fundamental frequency.
Further, in the above method of the application, the audio clips of each first phoneme sample are included at least
Initial time in the audio file sample.
Further, in the above method of the application, the step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is to obtain using the training of speech synthesis sample set
Neural network model, the speech synthesis sample includes the second phoneme sample and correspondence with duration and fundamental frequency information
Voice editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to be closed by voice
At exporting the corresponding target song audio of the lyrics text after model treatment.
Further, it in the above method of the application, further comprises the steps of:
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
Further, in the above method of the application, the phoneme prediction model be based on Tensorflow and
Kerass learning framework is established.
Secondly, disclosed herein as well is a kind of songs neural network based to generate system, it is suitable for automatic according to the lyrics
Generate song.The system comprises the following modules: import modul, for obtaining lyrics text and determining singer;Phoneme extracts mould
Block, for extracting phoneme from the lyrics text;Phoneme prediction module, for predicting each sound according to phoneme prediction model
Element corresponding duration and fundamental frequency, wherein the phoneme prediction model is according to the corresponding audio file sample of the singer
The neural network model that the training of this set obtains;Binding modules are obtained for combining the phoneme, duration and fundamental frequency
Target song audio.
Further, in the above system of the application, the import modul further includes following submodule: the page obtains mould
Block generates the page for obtaining song, and the song generates the page for the lyrics text and singer to be arranged;The lyrics are asked
Module is obtained, obtains lyrics text for generating the page from the song;Singer's determining module, for determining that the song is raw
At the singer being selected in the page.
Further, in the above system of the application, the phoneme extraction module further includes following submodule: morpheme sound
Plain transformation model training module, for obtaining morpheme phoneme conversion model according to standard phoneme dictionary creation sample set with training,
Wherein, sample set storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;Phoneme conversion mould
Block, for the morpheme in the lyrics text to be converted to the phoneme using the morpheme phoneme conversion model.
Further, in the above system of the application, the phoneme prediction module further includes following submodule: audio text
Part sample acquisition module, for obtaining the audio marked by samples of text from the corresponding audio file sample set of the singer
Paper sample;Audio clips obtain module, for extracting the first phoneme sample from the samples of text, and according to parted pattern
The audio file sample decomposition is obtained into the corresponding audio clips of each first phoneme sample;Training sample set obtains
Module, for obtaining each first phoneme sample corresponding duration and fundamental frequency according to the audio clips, with building
Training sample set;Phoneme prediction model training module, for predicting mould according to the training sample set training phoneme
Type;Duration and pitch prediction module, the phoneme prediction model for being obtained according to training predict the lyrics text
In each phoneme corresponding duration and fundamental frequency.
Further, in the above system of the application, the audio clips of each first phoneme sample are included at least
Initial time in the audio file sample.
Further, in the above system of the application, the binding modules further include following submodule: speech synthesis mould
Type obtains module, and for obtaining speech synthesis model, the speech synthesis model is trained using speech synthesis sample set
The neural network model arrived, the speech synthesis sample include the second phoneme sample with duration and fundamental frequency information and right
The voice editing answered;Target song audio output module, for being closed the phoneme, duration and fundamental frequency as the voice
At the input of model, to export the corresponding target song audio of the lyrics text after speech synthesis model treatment.
Further, in the above system of the application, further includes:
Audio accompaniment obtains module, for obtaining audio accompaniment;
Synthesis module obtains new target song sound for synthesizing the audio accompaniment with the target song audio
Frequently.
Further, in the above system of the application, the phoneme prediction model be based on Tensorflow and
Kerass learning framework is established.
Finally, the application also proposes a kind of computer readable storage medium, it is stored thereon with computer instruction.Above-metioned instruction
When being executed by processor, following steps are executed:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the sound
Plain prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
Further, when processor executes above-metioned instruction, the step S100 further includes following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
Further, when processor executes above-metioned instruction, the step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set
Storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
Further, when processor executes above-metioned instruction, the step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer
This;
S302 the first phoneme sample) is extracted from the samples of text, and according to parted pattern by the audio file sample
This segmentation obtains the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with
Construct training sample set;
S304) according to the training sample set training phoneme prediction model;
S305) predict that each phoneme is corresponding in the lyrics text according to the phoneme prediction model that training obtains
Duration and fundamental frequency.
Further, when processor executes above-metioned instruction, the audio clips of each first phoneme sample are at least wrapped
Include the initial time in the audio file sample.
Further, when processor executes above-metioned instruction, the step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is to obtain using the training of speech synthesis sample set
Neural network model, the speech synthesis sample includes the second phoneme sample and correspondence with duration and fundamental frequency information
Voice editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to be closed by voice
At exporting the corresponding target song audio of the lyrics text after model treatment.
Further, it when processor executes above-metioned instruction, further comprises the steps of:
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
Further, when processor executes above-metioned instruction, the phoneme prediction model be based on Tensorflow and
Kerass learning framework is established.
The application have the beneficial effect that using neural network to from lyrics Text Feature Extraction phoneme carry out the duration and
The prediction of fundamental frequency, so by the phoneme, duration and fundamental frequency combine, obtain target song audio so that development of games and
The great number that the company of video display is not necessarily to bear singer sings expense, to reduce song cost of manufacture and shorten song fabrication cycle.
Detailed description of the invention
Fig. 1 show the flow chart of song generation method neural network based disclosed in the present application;
Fig. 2 is shown in one embodiment of the application, and lyrics text and singer determine the flow chart of submethod;
Fig. 3 is shown in another embodiment of the application, and phoneme extracts the flow chart of submethod;
Fig. 4 is shown in another embodiment of the application, and phoneme predicts the flow chart of submethod;
Fig. 5 is shown in another embodiment of the application, and target song audio generates the flow chart of submethod;
Fig. 6 show the process of another embodiment of song generation method neural network based disclosed in the present application
Figure;
Fig. 7 show the structure chart that song neural network based disclosed in the present application generates system.
Specific embodiment
It is carried out below with reference to technical effect of the embodiment and attached drawing to the design of the application, specific structure and generation clear
Chu, complete description, to be completely understood by the purpose, scheme and effect of the application.It should be noted that the case where not conflicting
Under, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that unless otherwise specified, when a certain feature referred to as " fixation ", " connection " are in another feature,
It can directly fix, be connected to another feature, and can also fix, be connected to another feature indirectly.In addition, this
The descriptions such as upper and lower, left and right used in application are only the mutual alignment pass relative to each component part of the application in attached drawing
For system.In the application and the "an" of singular used in the attached claims, " described " and "the" also purport
It is including most forms, unless the context clearly indicates other meaning.
In addition, unless otherwise defined, the technology of all technical and scientific terms used herein and the art
The normally understood meaning of personnel is identical.Term used in the description is intended merely to description specific embodiment herein, without
It is to limit the application.Term as used herein "and/or" includes the arbitrary of one or more relevant listed items
Combination.
It will be appreciated that though various elements may be described in this application using term first, second, third, etc., but
These elements should not necessarily be limited by these terms.These terms are only used to for same type of element being distinguished from each other out.For example, not taking off
In the case where the application range, first element can also be referred to as second element, and similarly, second element can also be referred to as
First element.Depending on context, word as used in this " if " can be construed to " ... when " or " when ...
When ".
Method flow diagram shown in referring to Fig.1, in one or more embodiments of the application, the application proposes a kind of base
In the song generation method of neural network, suitable for automatically generating song according to the lyrics.Method includes the following steps:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the sound
Plain prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
Specifically, the company of development of games and video display is when carrying out song creation, it is only necessary to input lyrics text and select
Singer can automatically generate target song audio according to song generation method neural network based provided by the present application.Into
One step, phoneme is the smallest unit in voice, is analyzed according to the articulation in syllable, and a movement constitutes a sound
Element.Phoneme is divided into vowel, consonant two major classes.It is illustrated by taking the syllable of Chinese as an example, such as Chinese syllable ā () only one sound
Element, à i (love) is there are two phoneme, and there are three phonemes etc. by d ā i (slow-witted).The lyrics can be determined by extracting phoneme from lyrics text
The corresponding voice of text, but since different pronunciation main bodys (people or musical instrument) has the pronunciation of different phonemes or phonotactics
A little difference sings lyrics text with determining singer to reach, and needs to further use trained in advance
The corresponding phoneme prediction model of singer analyzes the phoneme of extraction, so determine each phoneme corresponding duration and
Fundamental frequency.Since multiple phonemes have the variation of such as tone or tone color after combining, when phoneme prediction model determines each phoneme
After corresponding duration and fundamental frequency, that is, it can determine tone or tone color of each phoneme etc., and then by the phoneme, duration
It is combined with fundamental frequency, obtains target song audio, reach the deep learning method simulation singer using neural network to lyrics text
This effect sung.Further, phoneme prediction model is according to the corresponding audio file sample set of the singer
The neural network model that training obtains.Sample in the audio file sample set is the corresponding performance audio of singer, is passed through
Phoneme prediction model is arrived using audio file sample set training, which is preferably represented and is drilled
The performance characteristic for the person of singing.
Determination for above-mentioned lyrics text and singer, referring to submethod flow chart shown in Fig. 2, the one of the application
In a or multiple embodiments, it can be realized by following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
In above-mentioned submethod process, song can be provided by production website or relative clients end and generate the page, user
The page can be generated by the song imports or input lyrics text.The song, which generates the page, can also provide existing singer
For selection by the user, which has the phoneme prediction model of respectively corresponding storage.The song generates the page and may be used also
To be provided with button, for carrying out the step of song generation after being clicked.
Further, referring to submethod flow chart shown in Fig. 3, in the said one or multiple embodiments of the application,
The step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set
Storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
Wherein, morpheme is the smallest voice, semantic knot body, is the smallest significant linguistic unit.Morpheme is not independent
The linguistic unit of utilization, it functions primarily as the material for constituting word.Voice, semantic combination are said it is, it is significant
Linguistic unit, it is therefore an objective to it is distinguished with syllable, some syllable light have sound nonsensical, are not to be regarded as morpheme, such as
" thunderbolt ", " won ton ".Say it is the smallest significant linguistic unit, be not belonging to the linguistic unit independently used, it is therefore an objective to it with
Word distinguishes.Morpheme is divided into three kinds of word formation patterns:
Single syllable morpheme: word-building is made of an interesting word of word;
Double syllabic morphemes: word-building is made of two interesting words of words;
Multisyllable morpheme: word-building is made of word more than two words just interesting.
Further, the key-value pair of phoneme and morpheme is stored in standard phoneme dictionary, it can be with from standard phoneme dictionary
The sample set for extracting morpheme phoneme conversion model is trained to obtain morpheme phoneme conversion mould by using the sample set
Type, for the morpheme in lyrics text to be converted to phoneme.
Further, referring to submethod flow chart shown in Fig. 4, in the said one or multiple embodiments of the application,
The step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer
This;
S302 the first phoneme sample) is extracted from the samples of text, and according to parted pattern by the audio file sample
This segmentation obtains the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with
Construct training sample set;
S304) according to the training sample set training phoneme prediction model;
S305) predict that each phoneme is corresponding in the lyrics text according to the phoneme prediction model that training obtains
Duration and fundamental frequency.
Specifically, collecting the performance audio of singer as audio file sample, and in the performance of audio file sample
Appearance is marked using samples of text, that is to say, that the audio file sample is the performance audio with the lyrics.Audio file sample
Originally it can be rated as the set of the first phoneme sample.Further, parted pattern is also a kind of neural network model, can will be every
The scene of a voicing phonemes is matched, so that its corresponding audio segmentation segment and its sounding position in audio are obtained,
Specifically, audio file sample can be split according to the first phoneme sample, the corresponding audio of the first phoneme sample is obtained
Editing.In one or more embodiments of the application, the audio clips of each first phoneme sample are included at least in institute
State the initial time in audio file sample.Further, each first phoneme can be obtained according to the audio clips
Sample corresponding duration and fundamental frequency, to construct training sample set, with the training phoneme prediction model, and further root
Each phoneme corresponding duration and base in the lyrics text are predicted according to the obtained phoneme prediction model of training
Frequently.
Further, submethod flow chart referring to Figure 5, in the said one or multiple embodiments of the application,
The step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is to obtain using the training of speech synthesis sample set
Neural network model, the speech synthesis sample includes the second phoneme sample and correspondence with duration and fundamental frequency information
Voice editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to be closed by voice
At exporting the corresponding target song audio of the lyrics text after model treatment.
Specifically, in the training process, the input of speech synthesis model is second with duration and fundamental frequency information
Phoneme sample, shaped like [(HH, 0.05s, 140hz), (EH, 0.07s, 141hz) ...], label is that the second phoneme sample is corresponding
Voice editing.Training complete the phoneme, duration and fundamental frequency that obtained speech synthesis model can be used for input into
Row processing, exports the corresponding target song audio of the lyrics text.
Further, referring to method flow diagram shown in fig. 6, in the said one or multiple embodiments of the application, also
Comprising steps of
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
Specifically, in the synthesis process, can be and carry out audio mixing to audio accompaniment and target song audio.Audio mixing is a kind of
To by recording, sampling or a kind of processing for more rail sound materials that modes are formed such as synthesizing, i.e., these more rail materials are passed through
Balance and adjustment, are mixed into the finished product of multichannel.After synthesis, obtained new target song audio has accompaniment.
In one or more embodiments of the application, the phoneme prediction model is based on Tensorflow and Kerass
Learning framework is established.Specifically, it is calculated by the machine learning of neural network Tensorflow and deep learning, in conjunction with Keras
The convolutional neural networks of Api and it is believed that function, the progress machine training in audio file sample set, to realize prediction phoneme
The deep learning of corresponding duration and fundamental frequency.The morpheme phoneme conversion model that is referred in certain the application, parted pattern and
Speech synthesis model is also possible to establish based on Tensorflow and Kerass learning framework.Those skilled in the art can basis
Corresponding classifier is established and trained using existing neural metwork training mode, and the application is to this not specific restriction.
Referring to function structure chart shown in Fig. 7, in one or more embodiments of the application, disclosed herein as well is one
Kind song neural network based generates system, suitable for automatically generating song according to the lyrics.The system comprises the following modules: leading
Enter module, for obtaining lyrics text and determining singer;Phoneme extraction module, for extracting sound from the lyrics text
Element;Phoneme prediction module, for predicting each phoneme corresponding duration and fundamental frequency according to phoneme prediction model, wherein
The phoneme prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
Binding modules obtain target song audio for combining the phoneme, duration and fundamental frequency.Specifically, development of games and
The company of video display is when carrying out song creation, it is only necessary to input lyrics text and select singer, can be provided according to the application
Song generation method neural network based automatically generate target song audio.Further, phoneme is the minimum in voice
Unit, analyzed according to the articulation in syllable, movement one phoneme of composition.It is big that phoneme is divided into vowel, consonant two
Class.It is illustrated by taking the syllable of Chinese as an example, such as Chinese syllable ā () only one phoneme, there are two phoneme, d ā i by à i (love)
(slow-witted) there are three phonemes etc..The corresponding voice of lyrics text can be determined by extracting phoneme from lyrics text, but due to
Different pronunciation main bodys (people or musical instrument) have a little difference to the pronunciation of different phonemes or phonotactics, in order to reach with true
Fixed singer sings lyrics text, needs to further use the corresponding phoneme prediction mould of trained singer in advance
Type analyzes the phoneme of extraction, and then determines each phoneme corresponding duration and fundamental frequency.Since multiple phonemes are in group
The variation that such as tone or tone color are had after conjunction, when phoneme prediction model determines each phoneme corresponding duration and fundamental frequency
Afterwards, that is, tone or the tone color etc. of each phoneme be can determine, and then the phoneme, duration and fundamental frequency are combined, obtain target
Song audio achievees the effect that sing lyrics text using the deep learning method simulation singer of neural network.Into
One step, phoneme prediction model is the neural network mould obtained according to the corresponding audio file sample set training of the singer
Type.Sample in the audio file sample set is the corresponding performance audio of singer, by using the audio file sample set
That closes training arrives phoneme prediction model, and the phoneme prediction model is allowed preferably to represent the performance characteristic of singer.
Further, in the said one of the application or multiple embodiments, the import modul further includes following submodule
Block: page acquisition module generates the page for obtaining song, and the song generates the page for the lyrics text to be arranged and drills
The person of singing;The lyrics ask acquisition module, obtain lyrics text for generating the page from the song;Singer's determining module, is used for
Determine that the song generates the singer being selected in the page.Specifically, can be provided by production website or relative clients end
Song generates the page, and user can generate the page by the song and import or input lyrics text.The song generates the page and may be used also
To provide existing singer for selection by the user, which has the phoneme prediction model of respectively corresponding storage.
The song generates the page and is also provided with button, for carrying out the step of song generation after being clicked.
Further, in the said one of the application or multiple embodiments, the phoneme extraction module further includes following
Submodule: morpheme phoneme conversion model training module, for obtaining morpheme according to standard phoneme dictionary creation sample set with training
Phoneme conversion model, wherein sample set stores the key-value pair for having phoneme and morpheme, and morpheme phoneme conversion model is neural network mould
Type;Phoneme conversion module, it is described for being converted to the morpheme in the lyrics text using the morpheme phoneme conversion model
Phoneme.Wherein, morpheme is the smallest voice, semantic knot body, is the smallest significant linguistic unit.Morpheme is not independent utilization
Linguistic unit, it function primarily as constitute word material.Say it is voice, semantic combination, significant language
Say unit, it is therefore an objective to it is distinguished with syllable, some syllable light have sound nonsensical, it is not to be regarded as morpheme, such as " thunderbolt ",
" won ton ".The smallest significant linguistic unit is said it is, the linguistic unit independently used is not belonging to, it is therefore an objective to which it is distinguished with word
It comes.Morpheme is divided into three kinds of word formation patterns: single syllable morpheme: word-building is made of an interesting word of word;Double syllabic morphemes:
Word-building is made of two interesting words of words;Multisyllable morpheme: word-building is made of word more than two words just interesting.Into one
Step, it is stored with the key-value pair of phoneme and morpheme in standard phoneme dictionary, morpheme sound can be extracted from standard phoneme dictionary
The sample set of plain transformation model is trained to obtain morpheme phoneme conversion model by using the sample set, for that will sing
Morpheme in word text is converted to phoneme.
Further, in the said one of the application or multiple embodiments, the phoneme prediction module further includes following
Submodule: audio file sample acquisition module, for obtaining from the corresponding audio file sample set of the singer by text
The audio file sample of sample labeling;Audio clips obtain module, for extracting the first phoneme sample from the samples of text,
And the audio file sample decomposition is obtained by the corresponding audio clips of each first phoneme sample according to parted pattern;Instruction
Practice sample set and obtain module, for obtaining each first phoneme sample corresponding duration according to the audio clips
And fundamental frequency, to construct training sample set;Phoneme prediction model training module, for according to training sample set training institute
State phoneme prediction model;Duration and pitch prediction module, the phoneme prediction model for being obtained according to training are predicted
Each phoneme corresponding duration and fundamental frequency in the lyrics text.Specifically, the performance audio for collecting singer is made
For audio file sample, and the performance content of audio file sample is marked using samples of text, that is to say, that the audio
Paper sample is the performance audio with the lyrics.Audio file sample can be rated as the set of the first phoneme sample.Further
, parted pattern is also a kind of neural network model, the scene of each voicing phonemes can be matched, so that it is right to obtain its
The audio segmentation segment answered and its sounding position in audio, specifically, can be by audio file sample according to the first phoneme
Sample is split, and obtains the corresponding audio clips of the first phoneme sample.In one or more embodiments of the application, each
The audio clips of the first phoneme sample include at least the initial time in the audio file sample.Further, may be used
To obtain each first phoneme sample corresponding duration and fundamental frequency according to the audio clips, to construct training sample
Set with the training phoneme prediction model, and further predicts the song according to the phoneme prediction model that training obtains
Each phoneme corresponding duration and fundamental frequency in word text.
Further, in the said one of the application or multiple embodiments, the binding modules further include following submodule
Block: speech synthesis model obtains module, and for obtaining speech synthesis model, the speech synthesis model is to use speech synthesis sample
The neural network model that the training of this set obtains, the speech synthesis sample include second with duration and fundamental frequency information
Phoneme sample and corresponding voice editing;Target song audio output module, for making the phoneme, duration and fundamental frequency
For the input of the speech synthesis model, sung with exporting the corresponding target of the lyrics text after speech synthesis model treatment
Bent audio.Specifically, in the training process, the input of speech synthesis model is the second sound with duration and fundamental frequency information
Plain sample, shaped like [(HH, 0.05s, 140hz), (EH, 0.07s, 141 hz) ...], label is that the second phoneme sample is corresponding
Voice editing.Training complete the phoneme, duration and fundamental frequency that obtained speech synthesis model can be used for input into
Row processing, exports the corresponding target song audio of the lyrics text.
Further, in the said one of the application or multiple embodiments, further includes:
Audio accompaniment obtains module, for obtaining audio accompaniment;
Synthesis module obtains new target song sound for synthesizing the audio accompaniment with the target song audio
Frequently.Specifically, in the synthesis process, can be and carry out audio mixing to audio accompaniment and target song audio.Audio mixing be it is a kind of to by
A kind of processing for more rail sound materials that the modes such as recording, sampling or synthesis are formed, i.e., these more rail materials through overbalance
And adjustment, it is mixed into the finished product of multichannel.After synthesis, obtained new target song audio has accompaniment.
In one or more embodiments of the application, the phoneme prediction model is based on Tensorflow and Kerass
Learning framework is established.Specifically, it is calculated by the machine learning of neural network Tensorflow and deep learning, in conjunction with Keras
The convolutional neural networks of Api and it is believed that function, the progress machine training in audio file sample set, to realize prediction phoneme
The deep learning of corresponding duration and fundamental frequency.The morpheme phoneme conversion model that is referred in certain the application, parted pattern and
Speech synthesis model is also possible to establish based on Tensorflow and Kerass learning framework.Those skilled in the art can basis
Corresponding classifier is established and trained using existing neural metwork training mode, and the application is to this not specific restriction.
It should be appreciated that embodiments herein can be by computer hardware, the combination of hardware and software or by depositing
The computer instruction in non-transitory computer-readable memory is stored up to be effected or carried out.Standard program can be used in this method
Technology-include realized in computer program configured with the non-transitory computer-readable storage media of computer program, wherein
Configured in this way storage medium operates computer in a manner of specific and is predefined --- according to retouching in a particular embodiment
The method and attached drawing stated.Each program can with the programming language of level process or object-oriented come realize with computer system
Communication.However, if desired, the program can be realized with compilation or machine language.Under any circumstance, which can be compiling
Or the language explained.In addition, the program can be run on the specific integrated circuit of programming for this purpose.
Further, this method can be realized in being operably coupled to suitable any kind of computing platform, wrap
Include but be not limited to PC, mini-computer, main frame, work station, network or distributed computing environment, individual or integrated
Computer platform or communicated with charged particle tool or other imaging devices etc..The various aspects of the application can be to deposit
The machine readable code on non-transitory storage medium or equipment is stored up to realize no matter be moveable or be integrated to calculating
Platform, such as hard disk, optical reading and/or write-in storage medium, RAM, ROM, so that it can be read by programmable calculator, when
Storage medium or equipment can be used for configuration and operation computer to execute process described herein when being read by computer.This
Outside, machine readable code, or part thereof can be transmitted by wired or wireless network.When such media include combining microprocessor
Or when other data processors realization instruction or program of the step above, application as described herein includes that these and other are different
The non-transitory computer-readable storage media of type.When being programmed according to methods and techniques described herein, the application is also
Including computer itself.
Computer program can be applied to input data to execute function as described herein, to convert input data with life
At storing to the output data of nonvolatile memory.Output information can also be applied to one or more output equipments as shown
Device.In the application preferred embodiment, the data of conversion indicate physics and tangible object, including the object generated on display
Reason and the particular visual of physical objects are described.
Therefore, should be with descriptive sense rather than restrictive sense understands the specification and drawings.However, by apparent
It is:, can be to the application in the case where not departing from the broader spirit and scope of the application as described in claims
Make various modifications and change.
Other modifications are in spirit herein.Therefore, although disclosed technology may be allowed various modifications and substitution structure
It makes, but has shown that in the accompanying drawings and its some embodiments shown in being described in detail above.It will be appreciated, however, that not
It is intended to for the application to be confined to disclosed one or more concrete forms;On the contrary, its intention covers such as the appended claims
Defined in fall in all modifications, alternative constructions and equivalent in spirit and scope.
Claims (10)
1. a kind of song generation method based on neural network, which comprises the following steps:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the phoneme is pre-
Surveying model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
2. the method according to claim 1, wherein the step S100 further includes following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
3. the method according to claim 1, wherein the step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set storage
There is the key-value pair of phoneme and morpheme, morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
4. the method according to claim 1, wherein the step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer;
S302 the first phoneme sample) is extracted from the samples of text, and is divided the audio file sample according to parted pattern
It cuts to obtain the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with building
Training sample set;
S304) according to the training sample set training phoneme prediction model;
S305 each phoneme is corresponding in the phoneme prediction model prediction lyrics text) obtained according to training holds
Continuous time and fundamental frequency.
5. according to the method described in claim 4, it is characterized in that, the audio clips of each first phoneme sample at least wrap
Include the initial time in the audio file sample.
6. the method according to claim 1, wherein the step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is the mind obtained using the training of speech synthesis sample set
Through network model, the speech synthesis sample includes the second phoneme sample and corresponding language with duration and fundamental frequency information
Sound editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to pass through speech synthesis mould
The corresponding target song audio of the lyrics text is exported after type processing.
7. method as claimed in any of claims 1 to 6, which is characterized in that further include:
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
8. the method according to claim 1, wherein the phoneme prediction model be based on Tensorflow and
Kerass learning framework is established.
9. a kind of song neural network based generates system, the production suitable for Two-dimensional electron action game, which is characterized in that
It comprises the following modules:
Import modul, for obtaining lyrics text and determining singer;
Phoneme extraction module, for extracting phoneme from the lyrics text;
Phoneme prediction module, for predicting each phoneme corresponding duration and fundamental frequency according to phoneme prediction model, wherein
The phoneme prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
Binding modules obtain target song audio for combining the phoneme, duration and fundamental frequency.
10. a kind of computer readable storage medium, is stored thereon with computer instruction, it is characterised in that the instruction is held by processor
It realizes when row such as the step of method described in any item of the claim 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811550908.4A CN109801608A (en) | 2018-12-18 | 2018-12-18 | A kind of song generation method neural network based and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811550908.4A CN109801608A (en) | 2018-12-18 | 2018-12-18 | A kind of song generation method neural network based and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN109801608A true CN109801608A (en) | 2019-05-24 |
Family
ID=66557198
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811550908.4A Pending CN109801608A (en) | 2018-12-18 | 2018-12-18 | A kind of song generation method neural network based and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109801608A (en) |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111429877A (en) * | 2020-03-03 | 2020-07-17 | 云知声智能科技股份有限公司 | Song processing method and device |
CN111477210A (en) * | 2020-04-02 | 2020-07-31 | 北京字节跳动网络技术有限公司 | Speech synthesis method and device |
CN112037757A (en) * | 2020-09-04 | 2020-12-04 | 腾讯音乐娱乐科技(深圳)有限公司 | Singing voice synthesis method and device and computer readable storage medium |
CN112071299A (en) * | 2020-09-09 | 2020-12-11 | 腾讯音乐娱乐科技(深圳)有限公司 | Neural network model training method, audio generation method and device and electronic equipment |
CN112164387A (en) * | 2020-09-22 | 2021-01-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer-readable storage medium |
CN112331222A (en) * | 2020-09-23 | 2021-02-05 | 北京捷通华声科技股份有限公司 | Method, system, equipment and storage medium for converting song tone |
CN112698757A (en) * | 2020-12-25 | 2021-04-23 | 北京小米移动软件有限公司 | Interface interaction method and device, terminal equipment and storage medium |
CN112750421A (en) * | 2020-12-23 | 2021-05-04 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method and device and readable storage medium |
CN112750422A (en) * | 2020-12-23 | 2021-05-04 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method, device and equipment |
CN112786013A (en) * | 2021-01-11 | 2021-05-11 | 北京有竹居网络技术有限公司 | Voice synthesis method and device based on album, readable medium and electronic equipment |
CN112802446A (en) * | 2019-11-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer-readable storage medium |
CN112906369A (en) * | 2021-02-19 | 2021-06-04 | 脸萌有限公司 | Lyric file generation method and device |
WO2021218324A1 (en) * | 2020-04-27 | 2021-11-04 | 北京字节跳动网络技术有限公司 | Song synthesis method, device, readable medium, and electronic apparatus |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106652995A (en) * | 2016-12-31 | 2017-05-10 | 深圳市优必选科技有限公司 | Voice broadcasting method and system for text |
CN106898340A (en) * | 2017-03-30 | 2017-06-27 | 腾讯音乐娱乐(深圳)有限公司 | The synthetic method and terminal of a kind of song |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
US9865251B2 (en) * | 2015-07-21 | 2018-01-09 | Asustek Computer Inc. | Text-to-speech method and multi-lingual speech synthesizer using the method |
CN107833572A (en) * | 2017-11-06 | 2018-03-23 | 芋头科技(杭州)有限公司 | The phoneme synthesizing method and system that a kind of analog subscriber is spoken |
CN108510975A (en) * | 2017-02-24 | 2018-09-07 | 百度(美国)有限责任公司 | System and method for real-time neural text-to-speech |
CN108806665A (en) * | 2018-09-12 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device |
US20180336880A1 (en) * | 2017-05-19 | 2018-11-22 | Baidu Usa Llc | Systems and methods for multi-speaker neural text-to-speech |
-
2018
- 2018-12-18 CN CN201811550908.4A patent/CN109801608A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9865251B2 (en) * | 2015-07-21 | 2018-01-09 | Asustek Computer Inc. | Text-to-speech method and multi-lingual speech synthesizer using the method |
CN106652995A (en) * | 2016-12-31 | 2017-05-10 | 深圳市优必选科技有限公司 | Voice broadcasting method and system for text |
CN108510975A (en) * | 2017-02-24 | 2018-09-07 | 百度(美国)有限责任公司 | System and method for real-time neural text-to-speech |
CN106971703A (en) * | 2017-03-17 | 2017-07-21 | 西北师范大学 | A kind of song synthetic method and device based on HMM |
CN106898340A (en) * | 2017-03-30 | 2017-06-27 | 腾讯音乐娱乐(深圳)有限公司 | The synthetic method and terminal of a kind of song |
US20180336880A1 (en) * | 2017-05-19 | 2018-11-22 | Baidu Usa Llc | Systems and methods for multi-speaker neural text-to-speech |
CN107833572A (en) * | 2017-11-06 | 2018-03-23 | 芋头科技(杭州)有限公司 | The phoneme synthesizing method and system that a kind of analog subscriber is spoken |
CN108806665A (en) * | 2018-09-12 | 2018-11-13 | 百度在线网络技术(北京)有限公司 | Phoneme synthesizing method and device |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112802446A (en) * | 2019-11-14 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer-readable storage medium |
CN112802446B (en) * | 2019-11-14 | 2024-05-07 | 腾讯科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer readable storage medium |
CN111429877A (en) * | 2020-03-03 | 2020-07-17 | 云知声智能科技股份有限公司 | Song processing method and device |
CN111477210A (en) * | 2020-04-02 | 2020-07-31 | 北京字节跳动网络技术有限公司 | Speech synthesis method and device |
WO2021218324A1 (en) * | 2020-04-27 | 2021-11-04 | 北京字节跳动网络技术有限公司 | Song synthesis method, device, readable medium, and electronic apparatus |
CN112037757A (en) * | 2020-09-04 | 2020-12-04 | 腾讯音乐娱乐科技(深圳)有限公司 | Singing voice synthesis method and device and computer readable storage medium |
CN112037757B (en) * | 2020-09-04 | 2024-03-15 | 腾讯音乐娱乐科技(深圳)有限公司 | Singing voice synthesizing method, singing voice synthesizing equipment and computer readable storage medium |
CN112071299A (en) * | 2020-09-09 | 2020-12-11 | 腾讯音乐娱乐科技(深圳)有限公司 | Neural network model training method, audio generation method and device and electronic equipment |
CN112164387A (en) * | 2020-09-22 | 2021-01-01 | 腾讯音乐娱乐科技(深圳)有限公司 | Audio synthesis method and device, electronic equipment and computer-readable storage medium |
CN112331222A (en) * | 2020-09-23 | 2021-02-05 | 北京捷通华声科技股份有限公司 | Method, system, equipment and storage medium for converting song tone |
CN112750422A (en) * | 2020-12-23 | 2021-05-04 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method, device and equipment |
CN112750421A (en) * | 2020-12-23 | 2021-05-04 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method and device and readable storage medium |
CN112750421B (en) * | 2020-12-23 | 2022-12-30 | 出门问问(苏州)信息科技有限公司 | Singing voice synthesis method and device and readable storage medium |
CN112750422B (en) * | 2020-12-23 | 2023-01-31 | 出门问问创新科技有限公司 | Singing voice synthesis method, device and equipment |
CN112698757A (en) * | 2020-12-25 | 2021-04-23 | 北京小米移动软件有限公司 | Interface interaction method and device, terminal equipment and storage medium |
CN112786013A (en) * | 2021-01-11 | 2021-05-11 | 北京有竹居网络技术有限公司 | Voice synthesis method and device based on album, readable medium and electronic equipment |
CN112906369A (en) * | 2021-02-19 | 2021-06-04 | 脸萌有限公司 | Lyric file generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109801608A (en) | A kind of song generation method neural network based and system | |
CN112863483B (en) | Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm | |
US10891928B2 (en) | Automatic song generation | |
Sóskuthy | Evaluating generalised additive mixed modelling strategies for dynamic speech analysis | |
Bigi | SPPAS-multi-lingual approaches to the automatic annotation of speech | |
CN108806655A (en) | Song automatically generates | |
KR20180063163A (en) | Automated music composition and creation machines, systems and processes employing musical experience descriptors based on language and / or graphic icons | |
CN109817197A (en) | Song generation method, device, computer equipment and storage medium | |
WO2020098269A1 (en) | Speech synthesis method and speech synthesis device | |
CN110148394A (en) | Song synthetic method, device, computer equipment and storage medium | |
Chow et al. | A musical approach to speech melody | |
CN109326280B (en) | Singing synthesis method and device and electronic equipment | |
CN110782918B (en) | Speech prosody assessment method and device based on artificial intelligence | |
CN109829482A (en) | Song training data processing method, device and computer readable storage medium | |
CN110010136A (en) | The training and text analyzing method, apparatus, medium and equipment of prosody prediction model | |
CN110164460A (en) | Sing synthetic method and device | |
CN113593520A (en) | Singing voice synthesis method and device, electronic equipment and storage medium | |
CN111508466A (en) | Text processing method, device and equipment and computer readable storage medium | |
CN112802446A (en) | Audio synthesis method and device, electronic equipment and computer-readable storage medium | |
CN101887719A (en) | Speech synthesis method, system and mobile terminal equipment with speech synthesis function | |
CN116320607A (en) | Intelligent video generation method, device, equipment and medium | |
Ballier et al. | Developing corpus interoperability for phonetic investigation of learner corpora | |
Yamamoto et al. | Nnsvs: A neural network-based singing voice synthesis toolkit | |
CN112242134A (en) | Speech synthesis method and device | |
CN109785818A (en) | A kind of music music method and system based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20190524 |
|
RJ01 | Rejection of invention patent application after publication |