CN109801608A - A kind of song generation method neural network based and system - Google Patents

A kind of song generation method neural network based and system Download PDF

Info

Publication number
CN109801608A
CN109801608A CN201811550908.4A CN201811550908A CN109801608A CN 109801608 A CN109801608 A CN 109801608A CN 201811550908 A CN201811550908 A CN 201811550908A CN 109801608 A CN109801608 A CN 109801608A
Authority
CN
China
Prior art keywords
phoneme
audio
song
sample
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811550908.4A
Other languages
Chinese (zh)
Inventor
周湘君
杜庆焜
陈海荣
张李京
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Xishan Yichuang Culture Co Ltd
Original Assignee
Wuhan Xishan Yichuang Culture Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Xishan Yichuang Culture Co Ltd filed Critical Wuhan Xishan Yichuang Culture Co Ltd
Priority to CN201811550908.4A priority Critical patent/CN109801608A/en
Publication of CN109801608A publication Critical patent/CN109801608A/en
Pending legal-status Critical Current

Links

Landscapes

  • Auxiliary Devices For Music (AREA)

Abstract

A kind of song generation method neural network based, comprising the following steps: obtain lyrics text and determine singer;Phoneme is extracted from the lyrics text;Each phoneme corresponding duration and fundamental frequency are predicted according to phoneme prediction model, wherein the phoneme prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;The phoneme, duration and fundamental frequency are combined, target song audio is obtained.The application also proposes that a kind of song neural network based corresponding with the above method generates system.

Description

A kind of song generation method neural network based and system
Technical field
The present invention relates to field of neural networks more particularly to a kind of song generation methods neural network based and system.
Background technique
In development of games and video display field, the demand to the songs such as subject areas or piece caudal flexure is growing day by day, increasingly shape one Mature industrial chain.
For the company of development of games and video display, the most of performance expense from well-known singer of the cost of song With.Since most of medium-sized and small enterprises are difficult to bear the performance expense of great number, have to take the second best, select the not high singer of popularity into Row is sung, and professional standards then cannot be guaranteed.
Therefore, the cost of manufacture for how saving the songs such as theme song or piece caudal flexure in development of games and video display field becomes The company of development of games and video display needs the problem of facing.
Summary of the invention
The purpose of the application is to solve the deficiencies in the prior art, provide a kind of song generation method neural network based and System can obtain the effect for reducing song cost of manufacture and shortening song fabrication cycle.
To achieve the goals above, the following technical solution is employed by the application.
Firstly, the application proposes a kind of song generation method neural network based, suitable for being automatically generated according to the lyrics Song.Method includes the following steps:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the sound Plain prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
Further, in the above method of the application, the step S100 further includes following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
Further, in the above method of the application, the step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set Storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
Further, in the above method of the application, the step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer This;
S302 the first phoneme sample) is extracted from the samples of text, and according to parted pattern by the audio file sample This segmentation obtains the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with Construct training sample set;
S304) according to the training sample set training phoneme prediction model;
S305) predict that each phoneme is corresponding in the lyrics text according to the phoneme prediction model that training obtains Duration and fundamental frequency.
Further, in the above method of the application, the audio clips of each first phoneme sample are included at least Initial time in the audio file sample.
Further, in the above method of the application, the step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is to obtain using the training of speech synthesis sample set Neural network model, the speech synthesis sample includes the second phoneme sample and correspondence with duration and fundamental frequency information Voice editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to be closed by voice At exporting the corresponding target song audio of the lyrics text after model treatment.
Further, it in the above method of the application, further comprises the steps of:
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
Further, in the above method of the application, the phoneme prediction model be based on Tensorflow and Kerass learning framework is established.
Secondly, disclosed herein as well is a kind of songs neural network based to generate system, it is suitable for automatic according to the lyrics Generate song.The system comprises the following modules: import modul, for obtaining lyrics text and determining singer;Phoneme extracts mould Block, for extracting phoneme from the lyrics text;Phoneme prediction module, for predicting each sound according to phoneme prediction model Element corresponding duration and fundamental frequency, wherein the phoneme prediction model is according to the corresponding audio file sample of the singer The neural network model that the training of this set obtains;Binding modules are obtained for combining the phoneme, duration and fundamental frequency Target song audio.
Further, in the above system of the application, the import modul further includes following submodule: the page obtains mould Block generates the page for obtaining song, and the song generates the page for the lyrics text and singer to be arranged;The lyrics are asked Module is obtained, obtains lyrics text for generating the page from the song;Singer's determining module, for determining that the song is raw At the singer being selected in the page.
Further, in the above system of the application, the phoneme extraction module further includes following submodule: morpheme sound Plain transformation model training module, for obtaining morpheme phoneme conversion model according to standard phoneme dictionary creation sample set with training, Wherein, sample set storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;Phoneme conversion mould Block, for the morpheme in the lyrics text to be converted to the phoneme using the morpheme phoneme conversion model.
Further, in the above system of the application, the phoneme prediction module further includes following submodule: audio text Part sample acquisition module, for obtaining the audio marked by samples of text from the corresponding audio file sample set of the singer Paper sample;Audio clips obtain module, for extracting the first phoneme sample from the samples of text, and according to parted pattern The audio file sample decomposition is obtained into the corresponding audio clips of each first phoneme sample;Training sample set obtains Module, for obtaining each first phoneme sample corresponding duration and fundamental frequency according to the audio clips, with building Training sample set;Phoneme prediction model training module, for predicting mould according to the training sample set training phoneme Type;Duration and pitch prediction module, the phoneme prediction model for being obtained according to training predict the lyrics text In each phoneme corresponding duration and fundamental frequency.
Further, in the above system of the application, the audio clips of each first phoneme sample are included at least Initial time in the audio file sample.
Further, in the above system of the application, the binding modules further include following submodule: speech synthesis mould Type obtains module, and for obtaining speech synthesis model, the speech synthesis model is trained using speech synthesis sample set The neural network model arrived, the speech synthesis sample include the second phoneme sample with duration and fundamental frequency information and right The voice editing answered;Target song audio output module, for being closed the phoneme, duration and fundamental frequency as the voice At the input of model, to export the corresponding target song audio of the lyrics text after speech synthesis model treatment.
Further, in the above system of the application, further includes:
Audio accompaniment obtains module, for obtaining audio accompaniment;
Synthesis module obtains new target song sound for synthesizing the audio accompaniment with the target song audio Frequently.
Further, in the above system of the application, the phoneme prediction model be based on Tensorflow and Kerass learning framework is established.
Finally, the application also proposes a kind of computer readable storage medium, it is stored thereon with computer instruction.Above-metioned instruction When being executed by processor, following steps are executed:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the sound Plain prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
Further, when processor executes above-metioned instruction, the step S100 further includes following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
Further, when processor executes above-metioned instruction, the step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set Storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
Further, when processor executes above-metioned instruction, the step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer This;
S302 the first phoneme sample) is extracted from the samples of text, and according to parted pattern by the audio file sample This segmentation obtains the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with Construct training sample set;
S304) according to the training sample set training phoneme prediction model;
S305) predict that each phoneme is corresponding in the lyrics text according to the phoneme prediction model that training obtains Duration and fundamental frequency.
Further, when processor executes above-metioned instruction, the audio clips of each first phoneme sample are at least wrapped Include the initial time in the audio file sample.
Further, when processor executes above-metioned instruction, the step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is to obtain using the training of speech synthesis sample set Neural network model, the speech synthesis sample includes the second phoneme sample and correspondence with duration and fundamental frequency information Voice editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to be closed by voice At exporting the corresponding target song audio of the lyrics text after model treatment.
Further, it when processor executes above-metioned instruction, further comprises the steps of:
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
Further, when processor executes above-metioned instruction, the phoneme prediction model be based on Tensorflow and Kerass learning framework is established.
The application have the beneficial effect that using neural network to from lyrics Text Feature Extraction phoneme carry out the duration and The prediction of fundamental frequency, so by the phoneme, duration and fundamental frequency combine, obtain target song audio so that development of games and The great number that the company of video display is not necessarily to bear singer sings expense, to reduce song cost of manufacture and shorten song fabrication cycle.
Detailed description of the invention
Fig. 1 show the flow chart of song generation method neural network based disclosed in the present application;
Fig. 2 is shown in one embodiment of the application, and lyrics text and singer determine the flow chart of submethod;
Fig. 3 is shown in another embodiment of the application, and phoneme extracts the flow chart of submethod;
Fig. 4 is shown in another embodiment of the application, and phoneme predicts the flow chart of submethod;
Fig. 5 is shown in another embodiment of the application, and target song audio generates the flow chart of submethod;
Fig. 6 show the process of another embodiment of song generation method neural network based disclosed in the present application Figure;
Fig. 7 show the structure chart that song neural network based disclosed in the present application generates system.
Specific embodiment
It is carried out below with reference to technical effect of the embodiment and attached drawing to the design of the application, specific structure and generation clear Chu, complete description, to be completely understood by the purpose, scheme and effect of the application.It should be noted that the case where not conflicting Under, the features in the embodiments and the embodiments of the present application can be combined with each other.
It should be noted that unless otherwise specified, when a certain feature referred to as " fixation ", " connection " are in another feature, It can directly fix, be connected to another feature, and can also fix, be connected to another feature indirectly.In addition, this The descriptions such as upper and lower, left and right used in application are only the mutual alignment pass relative to each component part of the application in attached drawing For system.In the application and the "an" of singular used in the attached claims, " described " and "the" also purport It is including most forms, unless the context clearly indicates other meaning.
In addition, unless otherwise defined, the technology of all technical and scientific terms used herein and the art The normally understood meaning of personnel is identical.Term used in the description is intended merely to description specific embodiment herein, without It is to limit the application.Term as used herein "and/or" includes the arbitrary of one or more relevant listed items Combination.
It will be appreciated that though various elements may be described in this application using term first, second, third, etc., but These elements should not necessarily be limited by these terms.These terms are only used to for same type of element being distinguished from each other out.For example, not taking off In the case where the application range, first element can also be referred to as second element, and similarly, second element can also be referred to as First element.Depending on context, word as used in this " if " can be construed to " ... when " or " when ... When ".
Method flow diagram shown in referring to Fig.1, in one or more embodiments of the application, the application proposes a kind of base In the song generation method of neural network, suitable for automatically generating song according to the lyrics.Method includes the following steps:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the sound Plain prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
Specifically, the company of development of games and video display is when carrying out song creation, it is only necessary to input lyrics text and select Singer can automatically generate target song audio according to song generation method neural network based provided by the present application.Into One step, phoneme is the smallest unit in voice, is analyzed according to the articulation in syllable, and a movement constitutes a sound Element.Phoneme is divided into vowel, consonant two major classes.It is illustrated by taking the syllable of Chinese as an example, such as Chinese syllable ā () only one sound Element, à i (love) is there are two phoneme, and there are three phonemes etc. by d ā i (slow-witted).The lyrics can be determined by extracting phoneme from lyrics text The corresponding voice of text, but since different pronunciation main bodys (people or musical instrument) has the pronunciation of different phonemes or phonotactics A little difference sings lyrics text with determining singer to reach, and needs to further use trained in advance The corresponding phoneme prediction model of singer analyzes the phoneme of extraction, so determine each phoneme corresponding duration and Fundamental frequency.Since multiple phonemes have the variation of such as tone or tone color after combining, when phoneme prediction model determines each phoneme After corresponding duration and fundamental frequency, that is, it can determine tone or tone color of each phoneme etc., and then by the phoneme, duration It is combined with fundamental frequency, obtains target song audio, reach the deep learning method simulation singer using neural network to lyrics text This effect sung.Further, phoneme prediction model is according to the corresponding audio file sample set of the singer The neural network model that training obtains.Sample in the audio file sample set is the corresponding performance audio of singer, is passed through Phoneme prediction model is arrived using audio file sample set training, which is preferably represented and is drilled The performance characteristic for the person of singing.
Determination for above-mentioned lyrics text and singer, referring to submethod flow chart shown in Fig. 2, the one of the application In a or multiple embodiments, it can be realized by following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
In above-mentioned submethod process, song can be provided by production website or relative clients end and generate the page, user The page can be generated by the song imports or input lyrics text.The song, which generates the page, can also provide existing singer For selection by the user, which has the phoneme prediction model of respectively corresponding storage.The song generates the page and may be used also To be provided with button, for carrying out the step of song generation after being clicked.
Further, referring to submethod flow chart shown in Fig. 3, in the said one or multiple embodiments of the application, The step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set Storage has the key-value pair of phoneme and morpheme, and morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
Wherein, morpheme is the smallest voice, semantic knot body, is the smallest significant linguistic unit.Morpheme is not independent The linguistic unit of utilization, it functions primarily as the material for constituting word.Voice, semantic combination are said it is, it is significant Linguistic unit, it is therefore an objective to it is distinguished with syllable, some syllable light have sound nonsensical, are not to be regarded as morpheme, such as " thunderbolt ", " won ton ".Say it is the smallest significant linguistic unit, be not belonging to the linguistic unit independently used, it is therefore an objective to it with Word distinguishes.Morpheme is divided into three kinds of word formation patterns:
Single syllable morpheme: word-building is made of an interesting word of word;
Double syllabic morphemes: word-building is made of two interesting words of words;
Multisyllable morpheme: word-building is made of word more than two words just interesting.
Further, the key-value pair of phoneme and morpheme is stored in standard phoneme dictionary, it can be with from standard phoneme dictionary The sample set for extracting morpheme phoneme conversion model is trained to obtain morpheme phoneme conversion mould by using the sample set Type, for the morpheme in lyrics text to be converted to phoneme.
Further, referring to submethod flow chart shown in Fig. 4, in the said one or multiple embodiments of the application, The step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer This;
S302 the first phoneme sample) is extracted from the samples of text, and according to parted pattern by the audio file sample This segmentation obtains the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with Construct training sample set;
S304) according to the training sample set training phoneme prediction model;
S305) predict that each phoneme is corresponding in the lyrics text according to the phoneme prediction model that training obtains Duration and fundamental frequency.
Specifically, collecting the performance audio of singer as audio file sample, and in the performance of audio file sample Appearance is marked using samples of text, that is to say, that the audio file sample is the performance audio with the lyrics.Audio file sample Originally it can be rated as the set of the first phoneme sample.Further, parted pattern is also a kind of neural network model, can will be every The scene of a voicing phonemes is matched, so that its corresponding audio segmentation segment and its sounding position in audio are obtained, Specifically, audio file sample can be split according to the first phoneme sample, the corresponding audio of the first phoneme sample is obtained Editing.In one or more embodiments of the application, the audio clips of each first phoneme sample are included at least in institute State the initial time in audio file sample.Further, each first phoneme can be obtained according to the audio clips Sample corresponding duration and fundamental frequency, to construct training sample set, with the training phoneme prediction model, and further root Each phoneme corresponding duration and base in the lyrics text are predicted according to the obtained phoneme prediction model of training Frequently.
Further, submethod flow chart referring to Figure 5, in the said one or multiple embodiments of the application, The step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is to obtain using the training of speech synthesis sample set Neural network model, the speech synthesis sample includes the second phoneme sample and correspondence with duration and fundamental frequency information Voice editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to be closed by voice At exporting the corresponding target song audio of the lyrics text after model treatment.
Specifically, in the training process, the input of speech synthesis model is second with duration and fundamental frequency information Phoneme sample, shaped like [(HH, 0.05s, 140hz), (EH, 0.07s, 141hz) ...], label is that the second phoneme sample is corresponding Voice editing.Training complete the phoneme, duration and fundamental frequency that obtained speech synthesis model can be used for input into Row processing, exports the corresponding target song audio of the lyrics text.
Further, referring to method flow diagram shown in fig. 6, in the said one or multiple embodiments of the application, also Comprising steps of
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
Specifically, in the synthesis process, can be and carry out audio mixing to audio accompaniment and target song audio.Audio mixing is a kind of To by recording, sampling or a kind of processing for more rail sound materials that modes are formed such as synthesizing, i.e., these more rail materials are passed through Balance and adjustment, are mixed into the finished product of multichannel.After synthesis, obtained new target song audio has accompaniment.
In one or more embodiments of the application, the phoneme prediction model is based on Tensorflow and Kerass Learning framework is established.Specifically, it is calculated by the machine learning of neural network Tensorflow and deep learning, in conjunction with Keras The convolutional neural networks of Api and it is believed that function, the progress machine training in audio file sample set, to realize prediction phoneme The deep learning of corresponding duration and fundamental frequency.The morpheme phoneme conversion model that is referred in certain the application, parted pattern and Speech synthesis model is also possible to establish based on Tensorflow and Kerass learning framework.Those skilled in the art can basis Corresponding classifier is established and trained using existing neural metwork training mode, and the application is to this not specific restriction.
Referring to function structure chart shown in Fig. 7, in one or more embodiments of the application, disclosed herein as well is one Kind song neural network based generates system, suitable for automatically generating song according to the lyrics.The system comprises the following modules: leading Enter module, for obtaining lyrics text and determining singer;Phoneme extraction module, for extracting sound from the lyrics text Element;Phoneme prediction module, for predicting each phoneme corresponding duration and fundamental frequency according to phoneme prediction model, wherein The phoneme prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer; Binding modules obtain target song audio for combining the phoneme, duration and fundamental frequency.Specifically, development of games and The company of video display is when carrying out song creation, it is only necessary to input lyrics text and select singer, can be provided according to the application Song generation method neural network based automatically generate target song audio.Further, phoneme is the minimum in voice Unit, analyzed according to the articulation in syllable, movement one phoneme of composition.It is big that phoneme is divided into vowel, consonant two Class.It is illustrated by taking the syllable of Chinese as an example, such as Chinese syllable ā () only one phoneme, there are two phoneme, d ā i by à i (love) (slow-witted) there are three phonemes etc..The corresponding voice of lyrics text can be determined by extracting phoneme from lyrics text, but due to Different pronunciation main bodys (people or musical instrument) have a little difference to the pronunciation of different phonemes or phonotactics, in order to reach with true Fixed singer sings lyrics text, needs to further use the corresponding phoneme prediction mould of trained singer in advance Type analyzes the phoneme of extraction, and then determines each phoneme corresponding duration and fundamental frequency.Since multiple phonemes are in group The variation that such as tone or tone color are had after conjunction, when phoneme prediction model determines each phoneme corresponding duration and fundamental frequency Afterwards, that is, tone or the tone color etc. of each phoneme be can determine, and then the phoneme, duration and fundamental frequency are combined, obtain target Song audio achievees the effect that sing lyrics text using the deep learning method simulation singer of neural network.Into One step, phoneme prediction model is the neural network mould obtained according to the corresponding audio file sample set training of the singer Type.Sample in the audio file sample set is the corresponding performance audio of singer, by using the audio file sample set That closes training arrives phoneme prediction model, and the phoneme prediction model is allowed preferably to represent the performance characteristic of singer.
Further, in the said one of the application or multiple embodiments, the import modul further includes following submodule Block: page acquisition module generates the page for obtaining song, and the song generates the page for the lyrics text to be arranged and drills The person of singing;The lyrics ask acquisition module, obtain lyrics text for generating the page from the song;Singer's determining module, is used for Determine that the song generates the singer being selected in the page.Specifically, can be provided by production website or relative clients end Song generates the page, and user can generate the page by the song and import or input lyrics text.The song generates the page and may be used also To provide existing singer for selection by the user, which has the phoneme prediction model of respectively corresponding storage. The song generates the page and is also provided with button, for carrying out the step of song generation after being clicked.
Further, in the said one of the application or multiple embodiments, the phoneme extraction module further includes following Submodule: morpheme phoneme conversion model training module, for obtaining morpheme according to standard phoneme dictionary creation sample set with training Phoneme conversion model, wherein sample set stores the key-value pair for having phoneme and morpheme, and morpheme phoneme conversion model is neural network mould Type;Phoneme conversion module, it is described for being converted to the morpheme in the lyrics text using the morpheme phoneme conversion model Phoneme.Wherein, morpheme is the smallest voice, semantic knot body, is the smallest significant linguistic unit.Morpheme is not independent utilization Linguistic unit, it function primarily as constitute word material.Say it is voice, semantic combination, significant language Say unit, it is therefore an objective to it is distinguished with syllable, some syllable light have sound nonsensical, it is not to be regarded as morpheme, such as " thunderbolt ", " won ton ".The smallest significant linguistic unit is said it is, the linguistic unit independently used is not belonging to, it is therefore an objective to which it is distinguished with word It comes.Morpheme is divided into three kinds of word formation patterns: single syllable morpheme: word-building is made of an interesting word of word;Double syllabic morphemes: Word-building is made of two interesting words of words;Multisyllable morpheme: word-building is made of word more than two words just interesting.Into one Step, it is stored with the key-value pair of phoneme and morpheme in standard phoneme dictionary, morpheme sound can be extracted from standard phoneme dictionary The sample set of plain transformation model is trained to obtain morpheme phoneme conversion model by using the sample set, for that will sing Morpheme in word text is converted to phoneme.
Further, in the said one of the application or multiple embodiments, the phoneme prediction module further includes following Submodule: audio file sample acquisition module, for obtaining from the corresponding audio file sample set of the singer by text The audio file sample of sample labeling;Audio clips obtain module, for extracting the first phoneme sample from the samples of text, And the audio file sample decomposition is obtained by the corresponding audio clips of each first phoneme sample according to parted pattern;Instruction Practice sample set and obtain module, for obtaining each first phoneme sample corresponding duration according to the audio clips And fundamental frequency, to construct training sample set;Phoneme prediction model training module, for according to training sample set training institute State phoneme prediction model;Duration and pitch prediction module, the phoneme prediction model for being obtained according to training are predicted Each phoneme corresponding duration and fundamental frequency in the lyrics text.Specifically, the performance audio for collecting singer is made For audio file sample, and the performance content of audio file sample is marked using samples of text, that is to say, that the audio Paper sample is the performance audio with the lyrics.Audio file sample can be rated as the set of the first phoneme sample.Further , parted pattern is also a kind of neural network model, the scene of each voicing phonemes can be matched, so that it is right to obtain its The audio segmentation segment answered and its sounding position in audio, specifically, can be by audio file sample according to the first phoneme Sample is split, and obtains the corresponding audio clips of the first phoneme sample.In one or more embodiments of the application, each The audio clips of the first phoneme sample include at least the initial time in the audio file sample.Further, may be used To obtain each first phoneme sample corresponding duration and fundamental frequency according to the audio clips, to construct training sample Set with the training phoneme prediction model, and further predicts the song according to the phoneme prediction model that training obtains Each phoneme corresponding duration and fundamental frequency in word text.
Further, in the said one of the application or multiple embodiments, the binding modules further include following submodule Block: speech synthesis model obtains module, and for obtaining speech synthesis model, the speech synthesis model is to use speech synthesis sample The neural network model that the training of this set obtains, the speech synthesis sample include second with duration and fundamental frequency information Phoneme sample and corresponding voice editing;Target song audio output module, for making the phoneme, duration and fundamental frequency For the input of the speech synthesis model, sung with exporting the corresponding target of the lyrics text after speech synthesis model treatment Bent audio.Specifically, in the training process, the input of speech synthesis model is the second sound with duration and fundamental frequency information Plain sample, shaped like [(HH, 0.05s, 140hz), (EH, 0.07s, 141 hz) ...], label is that the second phoneme sample is corresponding Voice editing.Training complete the phoneme, duration and fundamental frequency that obtained speech synthesis model can be used for input into Row processing, exports the corresponding target song audio of the lyrics text.
Further, in the said one of the application or multiple embodiments, further includes:
Audio accompaniment obtains module, for obtaining audio accompaniment;
Synthesis module obtains new target song sound for synthesizing the audio accompaniment with the target song audio Frequently.Specifically, in the synthesis process, can be and carry out audio mixing to audio accompaniment and target song audio.Audio mixing be it is a kind of to by A kind of processing for more rail sound materials that the modes such as recording, sampling or synthesis are formed, i.e., these more rail materials through overbalance And adjustment, it is mixed into the finished product of multichannel.After synthesis, obtained new target song audio has accompaniment.
In one or more embodiments of the application, the phoneme prediction model is based on Tensorflow and Kerass Learning framework is established.Specifically, it is calculated by the machine learning of neural network Tensorflow and deep learning, in conjunction with Keras The convolutional neural networks of Api and it is believed that function, the progress machine training in audio file sample set, to realize prediction phoneme The deep learning of corresponding duration and fundamental frequency.The morpheme phoneme conversion model that is referred in certain the application, parted pattern and Speech synthesis model is also possible to establish based on Tensorflow and Kerass learning framework.Those skilled in the art can basis Corresponding classifier is established and trained using existing neural metwork training mode, and the application is to this not specific restriction.
It should be appreciated that embodiments herein can be by computer hardware, the combination of hardware and software or by depositing The computer instruction in non-transitory computer-readable memory is stored up to be effected or carried out.Standard program can be used in this method Technology-include realized in computer program configured with the non-transitory computer-readable storage media of computer program, wherein Configured in this way storage medium operates computer in a manner of specific and is predefined --- according to retouching in a particular embodiment The method and attached drawing stated.Each program can with the programming language of level process or object-oriented come realize with computer system Communication.However, if desired, the program can be realized with compilation or machine language.Under any circumstance, which can be compiling Or the language explained.In addition, the program can be run on the specific integrated circuit of programming for this purpose.
Further, this method can be realized in being operably coupled to suitable any kind of computing platform, wrap Include but be not limited to PC, mini-computer, main frame, work station, network or distributed computing environment, individual or integrated Computer platform or communicated with charged particle tool or other imaging devices etc..The various aspects of the application can be to deposit The machine readable code on non-transitory storage medium or equipment is stored up to realize no matter be moveable or be integrated to calculating Platform, such as hard disk, optical reading and/or write-in storage medium, RAM, ROM, so that it can be read by programmable calculator, when Storage medium or equipment can be used for configuration and operation computer to execute process described herein when being read by computer.This Outside, machine readable code, or part thereof can be transmitted by wired or wireless network.When such media include combining microprocessor Or when other data processors realization instruction or program of the step above, application as described herein includes that these and other are different The non-transitory computer-readable storage media of type.When being programmed according to methods and techniques described herein, the application is also Including computer itself.
Computer program can be applied to input data to execute function as described herein, to convert input data with life At storing to the output data of nonvolatile memory.Output information can also be applied to one or more output equipments as shown Device.In the application preferred embodiment, the data of conversion indicate physics and tangible object, including the object generated on display Reason and the particular visual of physical objects are described.
Therefore, should be with descriptive sense rather than restrictive sense understands the specification and drawings.However, by apparent It is:, can be to the application in the case where not departing from the broader spirit and scope of the application as described in claims Make various modifications and change.
Other modifications are in spirit herein.Therefore, although disclosed technology may be allowed various modifications and substitution structure It makes, but has shown that in the accompanying drawings and its some embodiments shown in being described in detail above.It will be appreciated, however, that not It is intended to for the application to be confined to disclosed one or more concrete forms;On the contrary, its intention covers such as the appended claims Defined in fall in all modifications, alternative constructions and equivalent in spirit and scope.

Claims (10)

1. a kind of song generation method based on neural network, which comprises the following steps:
S100 it) obtains lyrics text and determines singer;
S200) phoneme is extracted from the lyrics text;
S300 each phoneme corresponding duration and fundamental frequency) are predicted according to phoneme prediction model, wherein the phoneme is pre- Surveying model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
S400) phoneme, duration and fundamental frequency are combined, obtain target song audio.
2. the method according to claim 1, wherein the step S100 further includes following sub-step:
S101 it) obtains song and generates the page, the song generates the page for the lyrics text and singer to be arranged;
S102 the page), which is generated, from the song obtains lyrics text;
S103) determine that the song generates the singer being selected in the page.
3. the method according to claim 1, wherein the step S200 further includes following sub-step:
S201 morpheme phoneme conversion model) is obtained with training according to standard phoneme dictionary creation sample set, wherein sample set storage There is the key-value pair of phoneme and morpheme, morpheme phoneme conversion model is neural network model;
S202 the morpheme in the lyrics text) is converted into the phoneme using the morpheme phoneme conversion model.
4. the method according to claim 1, wherein the step S300 further includes following sub-step:
S301) the audio file sample marked by samples of text is obtained from the corresponding audio file sample set of the singer;
S302 the first phoneme sample) is extracted from the samples of text, and is divided the audio file sample according to parted pattern It cuts to obtain the corresponding audio clips of each first phoneme sample;
S303 each first phoneme sample corresponding duration and fundamental frequency) are obtained according to the audio clips, with building Training sample set;
S304) according to the training sample set training phoneme prediction model;
S305 each phoneme is corresponding in the phoneme prediction model prediction lyrics text) obtained according to training holds Continuous time and fundamental frequency.
5. according to the method described in claim 4, it is characterized in that, the audio clips of each first phoneme sample at least wrap Include the initial time in the audio file sample.
6. the method according to claim 1, wherein the step S400 further includes following sub-step:
S401 speech synthesis model) is obtained, the speech synthesis model is the mind obtained using the training of speech synthesis sample set Through network model, the speech synthesis sample includes the second phoneme sample and corresponding language with duration and fundamental frequency information Sound editing;
S402) using the phoneme, duration and fundamental frequency as the input of the speech synthesis model, to pass through speech synthesis mould The corresponding target song audio of the lyrics text is exported after type processing.
7. method as claimed in any of claims 1 to 6, which is characterized in that further include:
S500 audio accompaniment) is obtained;
S600 the audio accompaniment is synthesized with the target song audio), obtains new target song audio.
8. the method according to claim 1, wherein the phoneme prediction model be based on Tensorflow and Kerass learning framework is established.
9. a kind of song neural network based generates system, the production suitable for Two-dimensional electron action game, which is characterized in that It comprises the following modules:
Import modul, for obtaining lyrics text and determining singer;
Phoneme extraction module, for extracting phoneme from the lyrics text;
Phoneme prediction module, for predicting each phoneme corresponding duration and fundamental frequency according to phoneme prediction model, wherein The phoneme prediction model is the neural network model obtained according to the corresponding audio file sample set training of the singer;
Binding modules obtain target song audio for combining the phoneme, duration and fundamental frequency.
10. a kind of computer readable storage medium, is stored thereon with computer instruction, it is characterised in that the instruction is held by processor It realizes when row such as the step of method described in any item of the claim 1 to 8.
CN201811550908.4A 2018-12-18 2018-12-18 A kind of song generation method neural network based and system Pending CN109801608A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811550908.4A CN109801608A (en) 2018-12-18 2018-12-18 A kind of song generation method neural network based and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811550908.4A CN109801608A (en) 2018-12-18 2018-12-18 A kind of song generation method neural network based and system

Publications (1)

Publication Number Publication Date
CN109801608A true CN109801608A (en) 2019-05-24

Family

ID=66557198

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811550908.4A Pending CN109801608A (en) 2018-12-18 2018-12-18 A kind of song generation method neural network based and system

Country Status (1)

Country Link
CN (1) CN109801608A (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111429877A (en) * 2020-03-03 2020-07-17 云知声智能科技股份有限公司 Song processing method and device
CN111477210A (en) * 2020-04-02 2020-07-31 北京字节跳动网络技术有限公司 Speech synthesis method and device
CN112037757A (en) * 2020-09-04 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Singing voice synthesis method and device and computer readable storage medium
CN112071299A (en) * 2020-09-09 2020-12-11 腾讯音乐娱乐科技(深圳)有限公司 Neural network model training method, audio generation method and device and electronic equipment
CN112164387A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN112331222A (en) * 2020-09-23 2021-02-05 北京捷通华声科技股份有限公司 Method, system, equipment and storage medium for converting song tone
CN112698757A (en) * 2020-12-25 2021-04-23 北京小米移动软件有限公司 Interface interaction method and device, terminal equipment and storage medium
CN112750421A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium
CN112750422A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method, device and equipment
CN112786013A (en) * 2021-01-11 2021-05-11 北京有竹居网络技术有限公司 Voice synthesis method and device based on album, readable medium and electronic equipment
CN112802446A (en) * 2019-11-14 2021-05-14 腾讯科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN112906369A (en) * 2021-02-19 2021-06-04 脸萌有限公司 Lyric file generation method and device
WO2021218324A1 (en) * 2020-04-27 2021-11-04 北京字节跳动网络技术有限公司 Song synthesis method, device, readable medium, and electronic apparatus

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106652995A (en) * 2016-12-31 2017-05-10 深圳市优必选科技有限公司 Voice broadcasting method and system for text
CN106898340A (en) * 2017-03-30 2017-06-27 腾讯音乐娱乐(深圳)有限公司 The synthetic method and terminal of a kind of song
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
US9865251B2 (en) * 2015-07-21 2018-01-09 Asustek Computer Inc. Text-to-speech method and multi-lingual speech synthesizer using the method
CN107833572A (en) * 2017-11-06 2018-03-23 芋头科技(杭州)有限公司 The phoneme synthesizing method and system that a kind of analog subscriber is spoken
CN108510975A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 System and method for real-time neural text-to-speech
CN108806665A (en) * 2018-09-12 2018-11-13 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device
US20180336880A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9865251B2 (en) * 2015-07-21 2018-01-09 Asustek Computer Inc. Text-to-speech method and multi-lingual speech synthesizer using the method
CN106652995A (en) * 2016-12-31 2017-05-10 深圳市优必选科技有限公司 Voice broadcasting method and system for text
CN108510975A (en) * 2017-02-24 2018-09-07 百度(美国)有限责任公司 System and method for real-time neural text-to-speech
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN106898340A (en) * 2017-03-30 2017-06-27 腾讯音乐娱乐(深圳)有限公司 The synthetic method and terminal of a kind of song
US20180336880A1 (en) * 2017-05-19 2018-11-22 Baidu Usa Llc Systems and methods for multi-speaker neural text-to-speech
CN107833572A (en) * 2017-11-06 2018-03-23 芋头科技(杭州)有限公司 The phoneme synthesizing method and system that a kind of analog subscriber is spoken
CN108806665A (en) * 2018-09-12 2018-11-13 百度在线网络技术(北京)有限公司 Phoneme synthesizing method and device

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112802446A (en) * 2019-11-14 2021-05-14 腾讯科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN112802446B (en) * 2019-11-14 2024-05-07 腾讯科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer readable storage medium
CN111429877A (en) * 2020-03-03 2020-07-17 云知声智能科技股份有限公司 Song processing method and device
CN111477210A (en) * 2020-04-02 2020-07-31 北京字节跳动网络技术有限公司 Speech synthesis method and device
WO2021218324A1 (en) * 2020-04-27 2021-11-04 北京字节跳动网络技术有限公司 Song synthesis method, device, readable medium, and electronic apparatus
CN112037757A (en) * 2020-09-04 2020-12-04 腾讯音乐娱乐科技(深圳)有限公司 Singing voice synthesis method and device and computer readable storage medium
CN112037757B (en) * 2020-09-04 2024-03-15 腾讯音乐娱乐科技(深圳)有限公司 Singing voice synthesizing method, singing voice synthesizing equipment and computer readable storage medium
CN112071299A (en) * 2020-09-09 2020-12-11 腾讯音乐娱乐科技(深圳)有限公司 Neural network model training method, audio generation method and device and electronic equipment
CN112164387A (en) * 2020-09-22 2021-01-01 腾讯音乐娱乐科技(深圳)有限公司 Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN112331222A (en) * 2020-09-23 2021-02-05 北京捷通华声科技股份有限公司 Method, system, equipment and storage medium for converting song tone
CN112750422A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method, device and equipment
CN112750421A (en) * 2020-12-23 2021-05-04 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium
CN112750421B (en) * 2020-12-23 2022-12-30 出门问问(苏州)信息科技有限公司 Singing voice synthesis method and device and readable storage medium
CN112750422B (en) * 2020-12-23 2023-01-31 出门问问创新科技有限公司 Singing voice synthesis method, device and equipment
CN112698757A (en) * 2020-12-25 2021-04-23 北京小米移动软件有限公司 Interface interaction method and device, terminal equipment and storage medium
CN112786013A (en) * 2021-01-11 2021-05-11 北京有竹居网络技术有限公司 Voice synthesis method and device based on album, readable medium and electronic equipment
CN112906369A (en) * 2021-02-19 2021-06-04 脸萌有限公司 Lyric file generation method and device

Similar Documents

Publication Publication Date Title
CN109801608A (en) A kind of song generation method neural network based and system
CN112863483B (en) Voice synthesizer supporting multi-speaker style and language switching and controllable rhythm
US10891928B2 (en) Automatic song generation
Sóskuthy Evaluating generalised additive mixed modelling strategies for dynamic speech analysis
Bigi SPPAS-multi-lingual approaches to the automatic annotation of speech
CN108806655A (en) Song automatically generates
KR20180063163A (en) Automated music composition and creation machines, systems and processes employing musical experience descriptors based on language and / or graphic icons
CN109817197A (en) Song generation method, device, computer equipment and storage medium
WO2020098269A1 (en) Speech synthesis method and speech synthesis device
CN110148394A (en) Song synthetic method, device, computer equipment and storage medium
Chow et al. A musical approach to speech melody
CN109326280B (en) Singing synthesis method and device and electronic equipment
CN110782918B (en) Speech prosody assessment method and device based on artificial intelligence
CN109829482A (en) Song training data processing method, device and computer readable storage medium
CN110010136A (en) The training and text analyzing method, apparatus, medium and equipment of prosody prediction model
CN110164460A (en) Sing synthetic method and device
CN113593520A (en) Singing voice synthesis method and device, electronic equipment and storage medium
CN111508466A (en) Text processing method, device and equipment and computer readable storage medium
CN112802446A (en) Audio synthesis method and device, electronic equipment and computer-readable storage medium
CN101887719A (en) Speech synthesis method, system and mobile terminal equipment with speech synthesis function
CN116320607A (en) Intelligent video generation method, device, equipment and medium
Ballier et al. Developing corpus interoperability for phonetic investigation of learner corpora
Yamamoto et al. Nnsvs: A neural network-based singing voice synthesis toolkit
CN112242134A (en) Speech synthesis method and device
CN109785818A (en) A kind of music music method and system based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190524

RJ01 Rejection of invention patent application after publication