CN109801618A

CN109801618A - A kind of generation method and device of audio-frequency information

Info

Publication number: CN109801618A
Application number: CN201711137172.3A
Authority: CN
Inventors: 李廣之; 王楠; 康世胤; 陀得意; 朱晓龙; 张友谊; 林少彬; 郑永森; 邹子馨; 何静; 陈在真
Original assignee: Shenzhen Tencent Computer Systems Co Ltd
Current assignee: Shenzhen Tencent Computer Systems Co Ltd
Priority date: 2017-11-16
Filing date: 2017-11-16
Publication date: 2019-05-24
Anticipated expiration: 2037-11-16
Also published as: CN109801618B

Abstract

The embodiment of the invention discloses a kind of generation method of audio-frequency information and device, for merging audio by input that text can generate voice and rhythm matching.The embodiment of the present invention provides a kind of generation method of audio-frequency information, which is characterized in that the described method includes: acquisition text information and the first audio-frequency information, the text information include at least one word；Linguistic analysis is carried out to text information, respectively obtains the linguistic feature of at least one word；The duration prediction of phone-level is carried out at least one described word respectively by duration prediction model and duration adaptively adjusts, obtains the phoneme duration prediction value of at least one word；The second audio-frequency information for corresponding at least one word is generated according to the phoneme duration prediction value of at least one word and corresponding linguistic feature；First audio-frequency information and second audio-frequency information are synthesized to obtain fusion audio-frequency information.

Description

A kind of generation method and device of audio-frequency information

Technical field

The present invention relates to field of computer technology more particularly to the generation methods and device of a kind of audio-frequency information.

Background technique

Music plays irreplaceable role in people's lives, music according to timing it is different again can there are many Music type, wherein hip-hop music (Rap music or Hip hop) is that a kind of follow accompanies, chants (too talkative) with the rhythm Music style, accompaniment fecund used are born in music sampling means.The generating mode of current audio information mainly by people into Pedestrian's work is created to complete, such as hip-hop music can work out hip-hop music by professional hip-hop singer.But for there is no music base For the people of plinth, do not have the ability of art music.

In order to realize creating without threshold for music, needs to generate the music appreciated for ordinary user, produce in the prior art Given birth to two kinds of following music generating modes: the first is the method that the sound of video is switched to music, and second is by user The voice of recording switchs to the method for music.For the method that the first generates music by the sound of video, need to handle video Data to extract the voice data of video carrying, then for the sound mix background music, to generate for user's appreciation Music.The method for passing through speech production music for second does not need processing video data, it is only necessary to by voice and background Music is synthesized, so that it may generate the music appreciated for user.

In the technical solution of above-mentioned generation music, background music simply can only be mixed for the sound of video or voice, This music generating mode does not account for the audio frequency characteristics of sound or voice itself, causes the music generated cannot be with The content matching of user's input.

Summary of the invention

The embodiment of the invention provides a kind of generation method of audio-frequency information and devices, for passing through input text Generate the audio-frequency information of voice and rhythm matching.

In order to solve the above technical problems, the embodiment of the present invention the following technical schemes are provided:

In a first aspect, the embodiment of the present invention provides a kind of generation method of audio-frequency information, which is characterized in that the method packet It includes:

It obtains text information and the first audio-frequency information, the text information includes at least one word；

Linguistic analysis is carried out to text information, respectively obtains the linguistic feature of at least one word；

It is adaptive to the duration prediction and duration of at least one described word progress phone-level respectively by duration prediction model It should adjust, obtain the phoneme duration prediction value of at least one word；

It is generated according to the phoneme duration prediction value of at least one word and corresponding linguistic feature described in corresponding to extremely Second audio-frequency information of a few word；

First audio-frequency information and second audio-frequency information are synthesized to obtain fusion audio-frequency information.

Second aspect, the embodiment of the present invention also provide a kind of generating means of audio-frequency information, and described device includes:

Module is obtained, for obtaining text information and the first audio-frequency information, the text information includes at least one word；

Linguistic analysis module respectively obtains the language of at least one word for carrying out linguistic analysis to text information Learn feature；

Duration prediction module, for by duration prediction model respectively at least one described word carry out phone-level when Long prediction and duration adaptively adjust, and obtain the phoneme duration prediction value of at least one word；

Audio generation module, for the phoneme duration prediction value and corresponding linguistic feature according at least one word Generate the second audio-frequency information for corresponding at least one word；

Audio Fusion Module obtains fusion audio for synthesizing first audio-frequency information and second audio-frequency information Information.

The third aspect, this application provides a kind of computer readable storage medium, in the computer readable storage medium It is stored with instruction, when run on a computer, so that computer executes method described in above-mentioned various aspects.

As can be seen from the above technical solutions, the embodiment of the present invention has the advantage that

In embodiments of the present invention, text information and the first audio-frequency information are obtained, linguistic analysis is carried out to text information, Respectively obtain the linguistic feature of at least one word.Phone-level is carried out at least one word respectively by duration prediction model Duration prediction and duration adaptively adjust, and obtain the phoneme duration prediction value of at least one word.According to the phoneme of at least one word Duration prediction value and corresponding linguistic feature are generated to exist corresponding to the second audio-frequency information of at least one word, finally by the first sound Frequency information and the second audio-frequency information synthesize to obtain fusion audio-frequency information.It only needs to get text information in the embodiment of the present invention, Linguistic analysis can be carried out to text information, pass through phoneme duration prediction value and linguistic feature the second sound generated Frequency information has passed through that duration prediction model has carried out duration prediction and duration adaptively adjusts, therefore second audio-frequency information is more It is easy to be adapted with the timing of the first audio-frequency information, so as to form the fusion audio-frequency information of more timing.The present invention The fusion audio-frequency information ultimately generated in embodiment can with get text information and the first audio-frequency information tight association, lead to Cross text information automatically process can generate voice and rhythm matching merge audio-frequency information.

Detailed description of the invention

To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for For those skilled in the art, it is also possible to obtain other drawings based on these drawings.

Fig. 1 is a kind of process blocks schematic diagram of the generation method of audio-frequency information provided in an embodiment of the present invention；

Fig. 2 is the music generating process schematic diagram provided in an embodiment of the present invention based on duration prediction model；

Fig. 3-a is a kind of composed structure schematic diagram of the generating means of audio-frequency information provided in an embodiment of the present invention；

Fig. 3-b is a kind of composed structure schematic diagram of linguistic analysis module provided in an embodiment of the present invention；

Fig. 3-c is a kind of composed structure schematic diagram of audio generation module provided in an embodiment of the present invention；

Fig. 3-d is the composed structure schematic diagram of the generating means of another audio-frequency information provided in an embodiment of the present invention；

Fig. 3-e is the composed structure schematic diagram of the generating means of another audio-frequency information provided in an embodiment of the present invention；

Fig. 4 is that the generation method of audio-frequency information provided in an embodiment of the present invention is applied to the composed structure schematic diagram of terminal.

Specific embodiment

In order to make the invention's purpose, features and advantages of the invention more obvious and easy to understand, below in conjunction with the present invention Attached drawing in embodiment, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that disclosed below Embodiment be only a part of the embodiment of the present invention, and not all embodiments.Based on the embodiments of the present invention, this field Technical staff's every other embodiment obtained, shall fall within the protection scope of the present invention.

Term " includes " in description and claims of this specification and above-mentioned attached drawing and " having " and they Any deformation, it is intended that covering non-exclusive includes so as to a series of process, method comprising units, system, product or to set It is standby to be not necessarily limited to those units, but be not clearly listed or these process, methods, product or equipment are consolidated The other units having.

It is described in detail separately below.

One embodiment of the generation method of audio-frequency information of the present invention, specifically can be applied to based on text information generate with The text is matched, has rigoroso fusion audio-frequency information.Refering to Figure 1, sound provided by one embodiment of the present invention The generation method of frequency information, may include steps of:

101, it obtains text information and the first audio-frequency information, text information includes at least one word.

In embodiments of the present invention, terminal can get text information and the first audio-frequency information, text information first Can be the text information that user is input to terminal, text information, which can be used for synthesizing with the first audio-frequency information, merges audio, In, the text information inputted in terminal can be user and be pre-reserved to terminal, is also possible to user and passes through browsing webpage The text that obtain or user is converted by inputting voice.First audio-frequency information specifically can be the back stored in terminal Song track, radio station audio content for being stored in scape music or terminal etc., herein without limitation.

102, linguistic analysis is carried out to text information, respectively obtains the linguistic feature of at least one word.

In embodiments of the present invention, after reading text information and the first audio-frequency information, can to text information into Row linguistic analysis is syncopated as at least one word from text information, generates corresponding linguistic feature to each word.Its In, linguistic feature refers on language to the feature gone out described by word content.For example, to the text stored in terminal Information carries out linguistic analysis, can carry out vocabulary, grammer and semantic analysis, sentence by sentence to determine the infrastructure of sentence With the composition of the phoneme of each word.

In some embodiments of the invention, step 102 carries out linguistic analysis to text information, respectively obtains at least one The linguistic feature of a word, comprising:

Make pauses in reading unpunctuated ancient writings to text information, obtains at least one Ziwen sheet；

According to part of speech and the rhythm to every Ziwen this progress word segmentation, the corresponding word of each Ziwen sheet is obtained；

Linguistic feature is extracted respectively to the corresponding word of each Ziwen sheet, the linguistics for obtaining at least one word is special Sign.

Wherein, terminal can carry out punctuate processing to text information, it can one section of text information is broken as at least one The Ziwen sheet of sentence carries out cutting for Ziwen sheet for each this progress of Ziwen word segmentation, such as according to part of speech feature and prosodic features Point, then each Ziwen originally can be split as one or more words, finally can extract linguistics spy for each word Sign.Wherein, linguistic feature can be extracted at least one word, the processing etc. to polyphone can also be performed, thus logical It crosses and text analyzing is carried out to word, provide information for subsequent characteristics, such as mainly include following treatment process: pronunciation life At, prosody prediction, part of speech prediction etc..

103, adaptive to the duration prediction and duration of at least one word progress phone-level respectively by duration prediction model It should adjust, obtain the phoneme duration prediction value of at least one word.

In embodiments of the present invention, after the linguistic feature for extracting at least one word, at least one word can be used Linguistic feature predicted come the phoneme duration at least one word, such as can be by the linguistic feature of at least one word It is input to duration prediction model, which can pass through neural network algorithm based on the word of known phonemes duration To generate.In the embodiment of the present invention pre-generate duration prediction model can be used for carry out the duration prediction of phone-level with Duration is certainly using adjustment.Wherein, phoneme refers to the pronunciation element for constituting word, and phoneme is the minimum unit or the smallest of syllabication Sound bite is the smallest phonetic unit come out from the angular divisions of sound quality.If Chinese syllable w é n (text) is there are two phoneme, There are four phonemes by ji à n (part).Word may include at least one phoneme in the embodiment of the present invention, and the duration prediction of phone-level is Refer to by duration prediction of the duration prediction model to word be using phoneme as duration unit, if a word is made of multiple phonemes, The sum of the duration of the available all phonemes for forming the word after duration prediction is carried out to the word.Due to music with commonly speak not Together, there is rhythm, therefore done in the result of duration prediction an adaptive adjustment, make each word can on beat, Guarantee that original pronunciation does not change simultaneously.The language of at least one word can be inputted in the embodiment of the present invention in duration prediction model Speech learns feature, then the duration prediction model can export the phoneme duration prediction value of at least one word.It is adopted in the embodiment of the present invention It can predict that the phoneme duration of each word, the phoneme duration of word have carried out adaptive by duration prediction model with duration prediction model It should adjust, so that phoneme included by least one word being syncopated as from text all has duration prediction value, therefore extremely The phoneme duration prediction value of a few word can be used for generating the audio-frequency information for being easier to match rhythm.

In some embodiments of the present invention, the generating process of duration prediction model can be accomplished by the following way.This The generation method for the audio-frequency information that inventive embodiments provide further includes following steps:

Phoneme duration is extracted from the training sample in training corpus；

Using the phoneme duration extracted as the input parameter of neural network, phoneme duration is carried out to duration prediction model Training；

After the completion of duration prediction model training, duration prediction model is carried out using the test sample in testing material library The test of phoneme duration；

Duration prediction model after the completion of output test.

Wherein, corpus of text can be obtained in the embodiment of the present invention first to generate duration prediction model, such as obtain training Corpus stores training sample in the training corpus, when obtaining the phoneme of the word in each sample for training sample Long, which carrys out duration prediction model of training as given value, such as can be by the way of neural network learning come complete At model training.Wherein, is exactly mainly training pattern parameter using the given value training pattern of phoneme duration, enables voice It is enough adaptive with rhythm.Testing material library can also be provided in the embodiment of the present invention, store test specimens in the testing material library This carries out phoneme to duration prediction model using the test sample in testing material library after the completion of duration prediction model training The test of duration, the duration prediction model after the completion of duration prediction model is restrained and then exports test.

104, it is generated according to the phoneme duration prediction value of at least one word and corresponding linguistic feature and corresponds at least one Second audio-frequency information of a word.

In embodiments of the present invention, after the phoneme duration prediction value that at least one word is got by abovementioned steps 103, Can phoneme duration prediction value and corresponding words based at least one word linguistic feature generate audio-frequency information, for be different from step Rapid 101 the first audio-frequency informations got, it is raw herein by the phoneme duration prediction value and linguistic feature of at least one word At audio-frequency information be defined as the second audio-frequency information.Such as the duration prediction value with reference to each phoneme, by the language of each phoneme Learn feature from text conversion be voice, wherein specifically text can be used switch to voice (Text To Speech, TTS), will lead to The linguistic feature for crossing phoneme duration prediction value and word that abovementioned steps obtain switchs to voice.

In some embodiments of the invention, step 104 is according to the phoneme duration prediction value of at least one word and corresponding Linguistic feature generates the second audio-frequency information for corresponding at least one word, comprising:

At least one word is carried out respectively according to the phoneme duration prediction value of at least one word and corresponding linguistic feature Acoustic feature prediction, obtains the corresponding acoustic feature of at least one word；

The corresponding acoustic feature of at least one word is converted into the corresponding sound clip of at least one word；

The corresponding sound clip of at least one word is synthesized together, the second audio-frequency information is obtained.

Wherein, the acoustics of the word is special to be predicted for the linguistic feature of the phoneme duration prediction value of each word He the word Sign, acoustic feature are referred to as sound characteristic, are then converted to the word by language synthetics to the acoustic feature of the word Corresponding sound clip.For example, individual character or phrase are extracted from speech synthesis, linguistics description is converted into speech waveform. Each word generates corresponding sound clip and then is synthesized together the corresponding sound clip of all words, has obtained The second whole audio-frequency information.

105, the first audio-frequency information and the second audio-frequency information are synthesized to obtain fusion audio-frequency information.

In embodiments of the present invention, text information can be converted to by step 104 by the second audio-frequency information, second sound Frequency information be include the voice of the corresponding lyrics content of text information, then by second audio-frequency information and the first audio-frequency information knot Symphysis specifically can be background music at final fusion audio-frequency information, such as the first audio-frequency information, then terminal can be by second Audio-frequency information and background music are synthesized together to obtain fusion audio-frequency information.It is to pass through that audio-frequency information is merged in the embodiment of the present invention The second audio-frequency information and the first audio-frequency information that text information is converted into synthesize to obtain, therefore the fusion audio-frequency information is when playing User, which can hear, the lyrics, rhythmical fusion audio.Such as the second audio-frequency information being converted by text information and cry of surprise Background music synthesis is breathed out, hip-hop music is obtained, to complete the processing that text turns hip-hop music (Text To Rap, TTR).

In some embodiments of the invention, step 104 is according to the phoneme duration prediction value of at least one word and corresponding Linguistic feature, which generates, to be corresponded to after the second audio-frequency information of at least one word, and the embodiment of the present invention is in addition to executing abovementioned steps 105, it can also include the following steps:

Judged according to the phoneme duration prediction value of the second audio-frequency information be between the second audio-frequency information and the first audio-frequency information It is no to meet rhythm matching；

If meeting the rhythm between the second audio-frequency information and the first audio-frequency information to match, triggering executes abovementioned steps 105: by the One audio-frequency information and the second audio-frequency information synthesize to obtain fusion audio-frequency information.

Wherein, corresponding prosodic features can be set for the first audio-frequency information in the embodiment of the present invention.For the second audio The phoneme duration prediction value of information judges that whether meeting the rhythm between the second audio-frequency information and the first audio-frequency information matches, wherein Prosodic features is the feature in terms of the rhythm possessed by audio-frequency information itself, which can pass through neural network model Detection mode exports.Such as first audio-frequency information after detecting prosodic features, which can be saved In audio database, got in abovementioned steps 101 available to first audio-frequency information pair when first audio-frequency information The prosodic features answered.It is only matched in the rhythm of the second audio-frequency information and the first audio-frequency information, it is ensured that each of lyrics Word can be on beat, while guaranteeing that original pronunciation does not change.

In some embodiments of the invention, before the generation method of audio-frequency information provided in an embodiment of the present invention is in addition to executing Step is stated, can also be included the following steps:

If being unsatisfactory for the rhythm between the second audio-frequency information and the first audio-frequency information to match, when by the phoneme of the second audio-frequency information Audio data progress rhythm matching in long predicted value and audio database, the audio data filtered out from audio database, Each audio data in audio database is corresponding with prosodic features；

Second audio-frequency information of generation and the audio data filtered out from audio database are synthesized, fusion audio is obtained Information.

Wherein, audio database can be set in the embodiment of the present invention, such as the audio database is specifically as follows background Music libraries.Multiple audio datas are preserved in audio database, each audio data is corresponding with prosodic features.For the second sound The phoneme duration prediction value of frequency letter and the audio-frequency information in audio-frequency information library carry out rhythm matching, obtain sieving from audio database The matched audio data of the rhythm of the audio data and the second audio-frequency information selected can be used for finally merging the life of audio-frequency information At since the audio data filtered out can be matched with the rhythm of the second audio-frequency information, it is ensured that each word in the lyrics Can be on beat, while guaranteeing that original pronunciation does not change.

By above embodiments to the description of the embodiment of the present invention it is found that get text information and the first audio-frequency information, Then linguistic analysis is carried out to text information, respectively obtains the linguistic feature of at least one word, at least one word is to text The word obtained after this progress cutting.Carry out the duration prediction of phone-level at least one word respectively by duration prediction model It is adaptively adjusted with duration, obtains the phoneme duration prediction value of at least one word.According to the phoneme duration prediction of at least one word Value and corresponding linguistic feature generate the second audio-frequency information for corresponding at least one word.By the second audio-frequency information and the first sound Frequency information synthesizes to obtain fusion audio-frequency information.It only needs to get text information in the embodiment of the present invention, so that it may to the text Information carries out linguistic analysis, has been passed through by phoneme duration prediction value and linguistic feature the second audio-frequency information generated Duration prediction model has carried out duration prediction and duration adaptively adjusts, therefore second audio-frequency information is easier and the first audio The timing of information is adapted, so as to form the fusion audio-frequency information of more timing.In the embodiment of the present invention most throughout one's life At fusion audio-frequency information can with get text information and the first audio-frequency information tight association, by text information from What dynamic processing can generate voice and rhythm matching merges audio-frequency information.

In order to facilitate a better understanding and implementation of the above scheme of the embodiment of the present invention, corresponding application scenarios of illustrating below come It is specifically described.

In the embodiment of the present invention, song can be woven by artificial intelligence (Artificial Intelligence), be one The forward-looking trial of kind, applies for later AI and provides the value of reference in more large scene.Next with hip-hop music For generation, TTR (Text To Rap) i.e. text switchs to Rap Music, for the text information of user's input, is extracting After linguistic feature, the duration prediction that can carry out phone-level is adaptively adjusted with duration, and text information is switched to voice, The subsequent background music that particular cadence is added, by background music and text voice seamless connection to complete hip-hop music, most throughout one's life At one section of beautiful music with hip-hop characteristic.

It is based primarily upon a Duan Wenben of user's input in the embodiment of the present invention, is then single word by these text dividings Or phrase, the duration prediction that can carry out phone-level respectively for these words is adaptively adjusted with duration, finally by TTS skill Word is switched to voice by art.As shown in Fig. 2, being the music generating process provided in an embodiment of the present invention based on duration prediction model Schematic diagram mainly comprises the following steps:

Step 1, the extracting parameter from corpus A.

Wherein, corpus A is training corpus, is stored with training corpus text in corpus A.

Step 2 extracts phoneme duration from the text of corpus A.

Step 3 carries out parametric modeling.

Step 4, model training.

Step 5 generates duration prediction model.

Wherein, the training corpus text in corpus A can be used for the training of duration prediction model, extract in step 2 Phoneme duration be the practical duration of phoneme included by word, the practical duration of the phoneme can be used for the parameter of training pattern, make The rhythm of voice is adaptive, the voice for the more timing of must attending the meeting.

The duration prediction model that the embodiment of the present invention generates, can be substantially carried out the duration prediction of phone-level.According to when The long rhythm to judge it has rhythm, therefore do in the result of duration prediction due to hip-hop and common difference of speaking The adaptive adjustment of one duration makes each word that can guarantee that original pronunciation does not change while on beat.

In embodiments of the present invention, loss function (cost function) can be used to judge mould in duration prediction model Whether type restrains.Loss function is reaction of the model to data fitting degree, is fitted poorer, and the value of loss function should just be got over Greatly.Loss function is when bigger, its corresponding gradient also wants bigger, in this case more new variables can update fastly one Point.In our present invention, the loss function used is exactly least squares error criterion (Mean Squared Error, MSE):

Wherein, C is for indicating loss function.G is duration prediction model, it is that a prediction is exported according to input matrix X Vector G (x), Y are true values.

Therefore, by this loss function it is found that the Euclidean distance of predicted value G (x) and true value Y is bigger, loss is more Greatly, otherwise it is just smaller.Its derivation process is as follows:

Wherein, w is a parameter to be trained in model G.The meaning of this parameter w is exactly weight, mainly to model The update of weight is the core parameter of model.

Next it is illustrated in conjunction with neural network model.With backpropagation (Back Propagation, BP) nerve net For network, the value transmitted backward can calculate in the following way:

Wherein,Indicate the expression for the value of BP neural network transmitted backward,What is indicated is learning rate.It can To indicate G (x) above-mentioned, in neural network, utilized by the loss layer of the last layerA loss is obtained with true value Y, Then neural network is by minimizing the value of this loss function come duration prediction model of training.

Step 6, the extracting parameter from corpus B.

Wherein, corpus B is testing material library, is stored with testing material text in corpus B.

Step 7 extracts phoneme duration from the text of corpus B.

Step 8, progress voice are adaptive.

Duration prediction model generation after can the phoneme duration to test text predict, to obtain optimal sound Plain duration prediction result.

Step 9 carries out linguistic feature extraction to text.

Wherein, user can input one section of text as the lyrics, then carry out text analyzing to the lyrics, extract for subsequent characteristics Information is provided, pronunciation generation, prosody prediction, part of speech prediction etc. is specifically included that, then carries out linguistic feature extraction, obtain text After the result of this analysis, to the language feature language feature that the result carries out Chinese be extracted and converted to the input of neural network to Amount.

Step 10 carries out the adjustment of phoneme duration according to linguistic feature, obtains phoneme duration prediction result.

Wherein it is possible to the duration prediction of phone-level be carried out using the duration prediction model of aforementioned generation, due to hip-hop sound It is happy that there is rhythm with difference of commonly speaking, therefore the adaptive adjustment of duration has been done in the result of duration prediction, make Each word can be on beat, while guaranteeing that original pronunciation does not change.

Step 11 carries out acoustic feature prediction according to linguistic feature and phoneme duration prediction result.

Step 12 generates voice.

Wherein, the linguistic feature before acoustic feature can be combined by the result that duration prediction model predicts carries out Prediction, and synthetic video on the basis of result herein.

The synthesis of step 13, voice and background music exports music.

Finally, voice and background music can be synthesized a Duan Yinle, therefore final song can be generated.

In the embodiment of the present invention, synthesis sound quality refers to the quality for the voice that speech synthesis system is exported, generally from clear Degree (or intelligibility), naturalness and continuity etc. carry out subjective assessment.Speech synthesis is improved to hip-hop music synthesis, by In hip-hop and common difference of speaking, there is rhythm, therefore done an adaptive adjustment in the result of duration prediction, make Each word can guarantee that original pronunciation does not change while on beat.It is exactly that step plus music, there is voice, So voice switchs to music and needs timing, therefore joined duration prediction, that is, the timing of voice and music are carried out Synthesis could form the music of more timing in this way.

It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.

For the above scheme convenient for the better implementation embodiment of the present invention, phase for implementing the above scheme is also provided below Close device.

It please refers to shown in Fig. 3-a, a kind of generating means 300 of audio-frequency information provided in an embodiment of the present invention may include: Obtain module 301, linguistic analysis module 302, duration prediction module 303, audio generation module 304, audio Fusion Module 305, wherein

Linguistic analysis module 302 respectively obtains the language of at least one word for carrying out linguistic analysis to text information Speech learns feature；

Duration prediction module 303, for carrying out phone-level at least one described word respectively by duration prediction model Duration prediction and duration adaptively adjust, obtain the phoneme duration prediction value of at least one word；

Audio generation module 304, for the phoneme duration prediction value and corresponding linguistics according at least one word Feature generates the second audio-frequency information for corresponding at least one word；

Audio Fusion Module 305, for synthesizing and being merged first audio-frequency information and second audio-frequency information Audio-frequency information.

In some embodiments of the invention, it please refers to shown in Fig. 3-b, the linguistic analysis module 302, comprising:

Punctuate module 3021 obtains at least one Ziwen sheet for making pauses in reading unpunctuated ancient writings to the text information；

Cutting module 3022, for, to every Ziwen this progress word segmentation, obtaining each Ziwen according to part of speech and the rhythm This corresponding word；

Characteristic extracting module 3023, for extracting linguistic feature respectively to the corresponding word of each Ziwen sheet, Obtain the linguistic feature of at least one word.

In some embodiments of the invention, it please refers to shown in Fig. 3-c, the audio generation module 304, comprising:

Acoustical predictions module 3041, for the phoneme duration prediction value and corresponding linguistics according at least one word Feature carries out acoustic feature prediction at least one described word respectively, and it is special to obtain the corresponding acoustics of at least one described word Sign；

Feature Conversion module 3042, for by the corresponding acoustic feature of at least one described word be converted to it is described at least The corresponding sound clip of one word；

Segment synthesis module 3043 is obtained for the corresponding sound clip of at least one described word to be synthesized together Believe to the second audio.

In some embodiments of the invention, it please refers to shown in Fig. 3-d, the generating means 300 of the audio-frequency information are also wrapped It includes:

Rhythm matching module 306, for judging described second according to the phoneme duration prediction value of second audio-frequency information Whether meet the rhythm between audio-frequency information and first audio-frequency information to match；If second audio-frequency information and first sound Meet rhythm matching between frequency information, triggers the audio Fusion Module 305 and execute.

In some embodiments of the invention, the rhythm matching module 306, if be also used to second audio-frequency information with Rhythm matching is unsatisfactory between first audio-frequency information, by the phoneme duration prediction value of second audio-frequency information and audio number Rhythm matching is carried out according to the audio data in library, obtains the audio data filtered out from the audio database, the audio Each audio data in database is corresponding with prosodic features；

The audio Fusion Module 305, second audio-frequency information for being also used to generate and from the audio database In filter out audio data synthesis, obtain fusion audio-frequency information.

In some embodiments of the invention, it please refers to shown in Fig. 3-e, the generating means 300 of the audio-frequency information are also wrapped It includes:

Sample extraction module 307, for extracting phoneme duration from the training sample in training corpus；

Model training module 308, the phoneme duration for will extract are pre- to duration as the input parameter of neural network Survey the training that model carries out phoneme duration；

Model measurement module 309, for using the survey in testing material library after the completion of duration prediction model training Sample this to the duration prediction model carry out phoneme duration test；

Output module 310, for exporting the duration prediction model after the completion of testing.

The embodiment of the invention also provides another terminals, as shown in figure 4, for ease of description, illustrating only and this hair The relevant part of bright embodiment, it is disclosed by specific technical details, please refer to present invention method part.The terminal can be with Being includes mobile phone, tablet computer, PDA (Personal Digital Assistant, personal digital assistant), POS (Point of Sales, point-of-sale terminal), any terminal device such as vehicle-mounted computer, taking the terminal as an example:

Fig. 4 shows the block diagram of the part-structure of mobile phone relevant to terminal provided in an embodiment of the present invention.With reference to figure 4, mobile phone includes: radio frequency (Radio Frequency, RF) circuit 1010, memory 1020, input unit 1030, display unit 1040, sensor 1050, voicefrequency circuit 1060, Wireless Fidelity (wireless fidelity, WiFi) module 1070, processor The components such as 1080 and power supply 1090.It will be understood by those skilled in the art that handset structure shown in Fig. 4 is not constituted pair The restriction of mobile phone may include perhaps combining certain components or different component cloth than illustrating more or fewer components It sets.

It is specifically introduced below with reference to each component parts of the Fig. 4 to mobile phone:

RF circuit 1010 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, handled to processor 1080；In addition, the data for designing uplink are sent to base station.In general, RF circuit 1010 include but is not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (Low Noise Amplifier, LNA), duplexer etc..In addition, RF circuit 1010 can also be logical with network and other equipment by wireless communication Letter.Any communication standard or agreement, including but not limited to global system for mobile communications (Global can be used in above-mentioned wireless communication System of Mobile communication, GSM), general packet radio service (General Packet Radio Service, GPRS), CDMA (Code Division Multiple Access, CDMA), wideband code division multiple access (Wideband Code Division Multiple Access, WCDMA), long term evolution (Long Term Evolution, LTE), Email, short message service (Short Messaging Service, SMS) etc..

Memory 1020 can be used for storing software program and module, and processor 1080 is stored in memory by operation 1020 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1020 can be led It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function Application program (such as sound-playing function, image player function etc.) etc.；Storage data area, which can be stored, uses institute according to mobile phone Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1020 may include high random access storage Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid State memory device.

Input unit 1030 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with And the related key signals input of function control.Specifically, input unit 1030 may include touch panel 1031 and other inputs Equipment 1032.Touch panel 1031, also referred to as touch screen collect touch operation (such as the user of user on it or nearby Use the behaviour of any suitable object or attachment such as finger, stylus on touch panel 1031 or near touch panel 1031 Make), and corresponding attachment device is driven according to preset formula.Optionally, touch panel 1031 may include touch detection Two parts of device and touch controller.Wherein, the touch orientation of touch detecting apparatus detection user, and detect touch operation band The signal come, transmits a signal to touch controller；Touch controller receives touch information from touch detecting apparatus, and by it It is converted into contact coordinate, then gives processor 1080, and order that processor 1080 is sent can be received and executed.In addition, Touch panel 1031 can be realized using multiple types such as resistance-type, condenser type, infrared ray and surface acoustic waves.In addition to touch surface Plate 1031, input unit 1030 can also include other input equipments 1032.Specifically, other input equipments 1032 may include But in being not limited to physical keyboard, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. It is one or more.

Display unit 1040 can be used for showing information input by user or be supplied to user information and mobile phone it is each Kind menu.Display unit 1040 may include display panel 1041, optionally, can use liquid crystal display (Liquid Crystal Display, LCD), the forms such as Organic Light Emitting Diode (Organic Light-Emitting Diode, OLED) To configure display panel 1041.Further, touch panel 1031 can cover display panel 1041, when touch panel 1031 detects After arriving touch operation on it or nearby, processor 1080 is sent to determine the type of touch event, is followed by subsequent processing device 1080 provide corresponding visual output according to the type of touch event on display panel 1041.Although in Fig. 4, touch panel 1031 and display panel 1041 are the input and input function for realizing mobile phone as two independent components, but in certain realities Apply in example, can be integrated by touch panel 1031 and display panel 1041 and that realizes mobile phone output and input function.

Mobile phone may also include at least one sensor 1050, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel 1041, proximity sensor can close display panel when mobile phone is moved in one's ear 1041 and/or backlight.As a kind of motion sensor, accelerometer sensor can detect in all directions (generally three axis) and add The size of speed can detect that size and the direction of gravity when static, can be used to identify application (such as the horizontal/vertical screen of mobile phone posture Switching, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.；Also as mobile phone The other sensors such as configurable gyroscope, barometer, hygrometer, thermometer, infrared sensor, details are not described herein.

Voicefrequency circuit 1060, loudspeaker 1061, microphone 1062 can provide the audio interface between user and mobile phone.Audio Electric signal after the audio data received conversion can be transferred to loudspeaker 1061, be converted by loudspeaker 1061 by circuit 1060 For voice signal output；On the other hand, the voice signal of collection is converted to electric signal by microphone 1062, by voicefrequency circuit 1060 Audio data is converted to after reception, then by after the processing of audio data output processor 1080, through RF circuit 1010 to be sent to ratio Such as another mobile phone, or audio data is exported to memory 1020 to be further processed.

WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 1070 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Fig. 4 is shown WiFi module 1070, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely Become in the range of the essence of invention and omits.

Processor 1080 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, By running or execute the software program and/or module that are stored in memory 1020, and calls and be stored in memory 1020 Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 1080 may include one or more processing units；Preferably, processor 1080 can integrate application processor and modulation /demodulation processing Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1080.

Mobile phone further includes the power supply 1090 (such as battery) powered to all parts, it is preferred that power supply can pass through power supply Management system and processor 1080 are logically contiguous, to realize management charging, electric discharge and power consumption pipe by power-supply management system The functions such as reason.

Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.

In embodiments of the present invention, processor 1080 included by the terminal also has control execution is above to be executed by terminal Method flow.

In addition it should be noted that, the apparatus embodiments described above are merely exemplary, wherein described as separation The unit of part description may or may not be physically separated, component shown as a unit can be or It can not be physical unit, it can it is in one place, or may be distributed over multiple network units.It can be according to reality Border needs to select some or all of the modules therein to achieve the purpose of the solution of this embodiment.In addition, provided by the invention In Installation practice attached drawing, the connection relationship between module indicates there is communication connection between them, specifically can be implemented as one Item or a plurality of communication bus or signal wire.Those of ordinary skill in the art are without creative efforts, it can It understands and implements.

Through the above description of the embodiments, it is apparent to those skilled in the art that the present invention can borrow Help software that the mode of required common hardware is added to realize, naturally it is also possible to by specialized hardware include specific integrated circuit, specially It is realized with CPU, private memory, special components and parts etc..Under normal circumstances, all functions of being completed by computer program are ok It is easily realized with corresponding hardware, moreover, being used to realize that the specific hardware structure of same function is also possible to a variety of more Sample, such as analog circuit, digital circuit or special circuit etc..But software program is real in situations more for the purpose of the present invention It is now more preferably embodiment.Based on this understanding, technical solution of the present invention substantially in other words makes the prior art The part of contribution can be embodied in the form of software products, which is stored in the storage medium that can be read In, such as the floppy disk of computer, USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory Device (RAM, Random Access Memory), magnetic or disk etc., including some instructions are with so that a computer is set Standby (can be personal computer, server or the network equipment etc.) executes method described in each embodiment of the present invention.

In conclusion the above embodiments are merely illustrative of the technical solutions of the present invention, rather than its limitations；Although referring to upper Stating embodiment, invention is explained in detail, those skilled in the art should understand that: it still can be to upper Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope for technical solution of various embodiments of the present invention that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of generation method of audio-frequency information, which is characterized in that the described method includes:

Linguistic analysis is carried out to the text information, respectively obtains the linguistic feature of at least one word；

The duration prediction of phone-level is carried out at least one described word respectively by duration prediction model and duration is adaptively adjusted It is whole, obtain the phoneme duration prediction value of at least one word；

It is generated according to the phoneme duration prediction value of at least one word and corresponding linguistic feature and corresponds to described at least one Second audio-frequency information of a word；

2. dividing the method according to claim 1, wherein described carry out linguistic analysis to the text information The linguistic feature of at least one word is not obtained, comprising:

Make pauses in reading unpunctuated ancient writings to the text information, obtains at least one Ziwen sheet；

Linguistic feature is extracted to the corresponding word of each Ziwen sheet respectively, obtains the linguistics of at least one word Feature.

3. the method according to claim 1, wherein the phoneme duration prediction of at least one word according to Value and corresponding linguistic feature generate the second audio-frequency information for corresponding at least one word, comprising:

According to the phoneme duration prediction value of at least one word and corresponding linguistic feature respectively at least one described word Acoustic feature prediction is carried out, the corresponding acoustic feature of at least one described word is obtained；

The corresponding acoustic feature of at least one described word is converted into the corresponding sound clip of at least one described word；

The corresponding sound clip of at least one described word is synthesized together, second audio-frequency information is obtained.

4. the method according to claim 1, wherein the phoneme duration prediction of at least one word according to Value and corresponding linguistic feature, which generate, to be corresponded to after the second audio-frequency information of at least one word, and the method is also wrapped It includes:

Judge that second audio-frequency information and first audio are believed according to the phoneme duration prediction value of second audio-frequency information Whether rhythm matching is met between breath；

If meeting the rhythm between second audio-frequency information and first audio-frequency information to match, triggering executes following steps: will First audio-frequency information and second audio-frequency information synthesize to obtain fusion audio-frequency information.

5. according to the method described in claim 4, it is characterized in that, the method also includes:

If being unsatisfactory for the rhythm between second audio-frequency information and first audio-frequency information to match, by second audio-frequency information Phoneme duration prediction value and audio database in audio data carry out rhythm matching, filtered out from the audio database Audio data, each audio data in the audio database is corresponding with prosodic features；

Second audio-frequency information of generation and the audio data filtered out from the audio database are synthesized, merged Audio-frequency information.

6. the method according to any one of claims 1 to 5, which is characterized in that the method also includes:

Phoneme duration is extracted from the training sample in training corpus；

Using the phoneme duration extracted as the input parameter of neural network, the instruction of phoneme duration is carried out to duration prediction model Practice；

After the completion of duration prediction model training, using the test sample in testing material library to the duration prediction model Carry out the test of phoneme duration；

The duration prediction model after the completion of output test.

7. a kind of generating means of audio-frequency information, which is characterized in that described device includes:

Duration prediction module, the duration for carrying out phone-level at least one described word respectively by duration prediction model are pre- It surveys and duration adaptively adjusts, obtain the phoneme duration prediction value of at least one word；

Audio generation module, for being generated according to the phoneme duration prediction value and corresponding linguistic feature of at least one word The second audio-frequency information corresponding at least one word；

Audio Fusion Module obtains fusion audio letter for synthesizing first audio-frequency information and second audio-frequency information Breath.

8. device according to claim 7, which is characterized in that the linguistic analysis module, comprising:

Punctuate module obtains at least one Ziwen sheet for making pauses in reading unpunctuated ancient writings to the text information；

Cutting module, for, to every Ziwen this progress word segmentation, it is corresponding to obtain each Ziwen sheet according to part of speech and the rhythm Word；

Characteristic extracting module obtains described for extracting linguistic feature respectively to the corresponding word of each Ziwen sheet The linguistic feature of at least one word.

9. device according to claim 7, which is characterized in that the audio generation module, comprising:

Acoustical predictions module, for the phoneme duration prediction value and corresponding linguistic feature difference according at least one word Acoustic feature prediction is carried out at least one described word, obtains the corresponding acoustic feature of at least one described word；

Feature Conversion module, for the corresponding acoustic feature of at least one described word to be converted to described at least one word point Not corresponding sound clip；

Segment synthesis module obtains described for the corresponding sound clip of at least one described word to be synthesized together Two audio-frequency informations.

10. device according to claim 7, which is characterized in that the generating means of the audio-frequency information, further includes:

Rhythm matching module, for judging second audio-frequency information according to the phoneme duration prediction value of second audio-frequency information Whether meet the rhythm between first audio-frequency information to match；If second audio-frequency information and first audio-frequency information it Between meet rhythm matching, trigger the audio Fusion Module and execute.

11. device according to claim 10, which is characterized in that the rhythm matching module, if being also used to described second It is unsatisfactory for the rhythm between audio-frequency information and first audio-frequency information to match, by the phoneme duration prediction of second audio-frequency information Value carries out rhythm matching with the audio data in audio database, obtains the audio number filtered out from the audio database According to each audio data in the audio database is corresponding with prosodic features；

The audio Fusion Module is also used to filter out by second audio-frequency information generated and from the audio database Audio data synthesis, obtain fusion audio-frequency information.

12. device according to any one of claims 7 to 11, which is characterized in that the generating means of the audio-frequency information Further include:

Sample extraction module, for extracting phoneme duration from the training sample in training corpus；

Model training module, the phoneme duration for will extract is as the input parameter of neural network, to duration prediction model Carry out the training of phoneme duration；

Model measurement module, for using the test sample in testing material library after the completion of duration prediction model training The test of phoneme duration is carried out to the duration prediction model；

Output module, for exporting the duration prediction model after the completion of testing.

13. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as Method as claimed in any one of claims 1 to 6.