CN110444191A

CN110444191A - A kind of method, the method and device of model training of prosody hierarchy mark

Info

Publication number: CN110444191A
Application number: CN201910751371.6A
Authority: CN
Inventors: 吴志勇; 杜耀; 康世胤; 苏丹; 俞栋
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Priority date: 2019-01-22
Filing date: 2019-01-22
Publication date: 2019-11-12
Anticipated expiration: 2039-01-22
Also published as: CN109697973A; CN110444191B

Abstract

This application discloses a kind of methods of prosody hierarchy mark, this method is applied to artificial intelligence field, speech synthesis field is particularly applicable to, this method comprises: obtaining text data and audio data to be marked, wherein, there is corresponding relationship between text data and audio data to be marked；The text feature set to be marked of each word is extracted according to text data to be marked；The acoustic feature set of each word is extracted according to audio data；According to the term identification of each word, the text feature set to be marked of each word and the acoustic feature set of each word, prosody hierarchy structure is obtained by prosody hierarchy marking model.Disclosed herein as well is a kind of method of model training, prosody hierarchy annotation equipment and model training apparatus.The application combines text feature and acoustic feature establishes prosody hierarchy marking model, and feature more abundant can be provided for the mark of prosody hierarchy, improves the accuracy of prosody hierarchy mark, promotes the effect of speech synthesis.

Description

A kind of method, the method and device of model training of prosody hierarchy mark

The application be submission on January 22nd, 2019 Patent Office of the People's Republic of China, application No. is 201910060152.3, it is entitled The divisional application of the Chinese patent application of " a kind of method, the method and device of model training of prosody hierarchy mark ".

Technical field

This application involves the methods of intelligent sound synthesis field more particularly to a kind of prosody hierarchy mark, model training Method and relevant apparatus.

Background technique

In order to realize the speech synthesis system of high quality, magnanimity has accurately marked the data of prosody hierarchy structure to Guan Chong It wants, prosody hierarchy structure is modeled to the rhythm of voice and its pause, and one being capable of accurate automatic marking prosody hierarchy knot The method of structure, to the naturalness important in inhibiting of rapid build corpus of speech synthesis, promotion speech synthesis.

Currently, needing prosody hierarchy structure automatic marking with machine learning method one automatic marking model of training, In There are mainly two types of in feature selecting, one is with text feature, first segments, then extract the text feature of word, pass through engineering Learning method adjudicates the prosody hierarchy structure type of the word, and another kind is to use acoustic feature, mainly by the pause of detection audio Different prosody hierarchy structure types is distinguished according to the size of pause duration in position.

However, marking task merely with text data, then before not accounting for prosody hierarchy structure boundary under actual conditions One syllable when with extend and intonation phrasal boundary at often with pausing in short-term the phenomenon that, and merely with acoustics spy Sign, it is difficult to three layers of prosody hierarchy structure are accurately marked simultaneously, and have ignored the inner link of text feature and acoustic feature, The mark effect for reducing prosody hierarchy structure, affects the quality for the corpus that speech synthesis is relied on.

Summary of the invention

The embodiment of the present application provides method, the method and device of model training of a kind of prosody hierarchy mark, joint text Eigen and acoustic feature establish prosody hierarchy marking model, and feature more abundant can be provided for prosody hierarchy, use More accurate prosody hierarchy marking model can be improved the accuracy of prosody hierarchy mark, promote the effect of speech synthesis.

In view of this, the application first aspect provides a kind of method of prosody hierarchy mark, comprising:

Obtain text data and audio data to be marked, wherein the text data to be marked and the audio data Between there is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a word mark Know；

The text feature set to be marked of each word is extracted according to the text data to be marked, wherein described wait mark Explanatory notes eigen set includes punctuate type after part of speech, word be long and word；

The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set Including acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word；

According to the term identification of each word, the text feature set to be marked of each word and described every The acoustic feature set of a word obtains prosody hierarchy structure by prosody hierarchy marking model.

The application second aspect provides a kind of method of model training, comprising:

Obtain to training text data and audio data train, wherein it is described to training text data with it is described to There is corresponding relationship between training audio data, it is described to include at least one word, each word pair in training text data Ying Yuyi term identification；

According to it is described to training text data extract each word to training text characteristic set, wherein it is described wait instruct Practicing text feature set includes punctuate type after part of speech, word be long and word；

The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described After training acoustic feature set includes final syllable duration, word between pause duration, final syllable acoustics statistical nature and word Acoustic feature changing value；

To the corresponding term identification of each word, each word to training text characteristic set and described The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy mark Injection molding type is for being labeled prosody hierarchy structure.

The application third aspect provides a kind of prosody hierarchy annotation equipment, comprising:

Obtain module, for obtaining text data and audio data to be marked, wherein the text data to be marked with There is corresponding relationship, the text data to be marked includes at least one word, and each word is corresponding between the audio data In a term identification；

Extraction module, for according to it is described acquisition module obtain the text data to be marked extract each word to Mark text feature set, wherein the text feature set to be marked includes punctuate type after part of speech, word be long and word；

The extraction module is also used to extract each word according to the audio data that the acquisition module obtains Acoustic feature set, wherein the acoustic feature set includes pause duration, final syllable sound after final syllable duration, word Learn acoustic feature changing value between statistical nature and word；

Prediction module, each word extracted for term identification, the extraction module according to each word The text feature set to be marked of language and the acoustic feature set of each word, are obtained by prosody hierarchy marking model Prosody hierarchy structure.

In a kind of possible design, in the first implementation of the third aspect of the embodiment of the present application,

The prediction module, specifically for by the prosody hierarchy marking model determine rhythm word, prosodic phrase and At least one of intonation phrase；

Or,

Rhythm word and/or prosodic phrase are determined by the prosody hierarchy marking model.

The application fourth aspect provides a kind of model training apparatus, comprising:

Module is obtained, for obtaining to training text data and audio data to be trained, wherein described to training text There is corresponding relationship between data and the audio data to be trained, it is described to include at least one word in training text data Language, each word correspond to a term identification；

Extraction module, for according to it is described acquisition module obtain described in training text data extract each word to Training text characteristic set, wherein it is described after training text characteristic set include punctuate type after part of speech, word be long and word；

It is described every to be also used to the audio data extraction to be trained according to acquisition module acquisition for the extraction module The acoustic feature set to be trained of a word, wherein described to stop after training acoustic feature set includes final syllable duration, word Immediately acoustic feature changing value between long, final syllable acoustics statistical nature and word；

Training module, it is described each for being extracted to the corresponding term identification of each word, the extraction module The acoustic feature set to be trained to training text characteristic set and each word of word is trained, and obtains the rhythm Level marking model, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure.

In a kind of possible design, in the first implementation of the fourth aspect of the embodiment of the present application, the mould Type training device further includes processing module and generation module；

The processing module, for acquisitions module acquisition to training text data and audio data to be trained it Afterwards, word segmentation processing is carried out to training text data to described, obtains at least one word；

The acquisition module is also used to the target word mark according to corresponding to default term identification Relation acquisition target word Know, wherein the default term identification relationship is used to indicate the relationship between preset each word and term identification, institute State any one word that target word belongs at least one described word that the processing module is handled；

The generation module, for generate it is described to target word corresponding to target word described in training text data to Amount；

The training module, specifically for the target term identification obtained to the acquisition module and the generation The target term vector that module generates is trained, and obtains the first model parameter, wherein first model parameter is for giving birth to At the word embeding layer in the prosody hierarchy marking model.

In a kind of possible design, in second of implementation of the fourth aspect of the embodiment of the present application,

The extraction module, specifically for obtain it is described to the part of speech of target word, word in training text data it is long and Punctuate type after word, wherein the part of speech indicates the syntactic category of word as a result, the long number of words for indicating word of institute's predicate, described Punctuate type is used to indicate punctuate type corresponding after word after word；

Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein described Related word is the word for having incidence relation with the target word；

The training module, specifically for the part of speech to the target word, punctuate type and described after word is long and word Punctuate type is trained after the part of speech of related word, word be long and word, obtains the second model parameter, second model parameter For generating the text neural network in the prosody hierarchy marking model.

In a kind of possible design, in the third implementation of the fourth aspect of the embodiment of the present application, the mould Type training device further includes alignment module；

The alignment module, for acquisitions module acquisition to training text data and audio data to be trained it Afterwards, pressure alignment is carried out to training text data and the audio data to be trained to described, obtains the text of time unifying；

The extraction module, when final syllable specifically for determining the target word according to the time unifying text It is long.

In a kind of possible design, in the 4th kind of implementation of the fourth aspect of the embodiment of the present application,

The extraction module, when pausing after the word specifically for determining the target word according to the time unifying text It is long.

In a kind of possible design, in the 5th kind of implementation of the fourth aspect of the embodiment of the present application,

The extraction module, specifically for being mentioned according to the time unifying text and in the audio data to be trained The frame number and voiced sound end frame of the final syllable voiced sound start frame of the target word is calculated in the fundamental frequency information taken Frame number；

Extract the logarithm fundamental curve and logarithmic energy curve of the audio data to be trained；

According to the frame number of the final syllable voiced sound start frame of the target word, the frame number, described of voiced sound end frame The final syllable acoustics statistical nature of the target word is calculated in logarithm fundamental curve and logarithmic energy curve, wherein The final syllable acoustics statistical nature include the maximum value of the logarithm fundamental curve, minimum value, interval range, average value with At least one of and in variance, the final syllable acoustics statistical nature further includes the maximum value, most of the logarithmic energy curve At least one of in small value, interval range, average value and variance.

In a kind of possible design, in the 6th kind of implementation of the fourth aspect of the embodiment of the present application,

The extraction module, specifically for being mentioned according to the time unifying text and in the audio data to be trained The fundamental frequency information taken is calculated under the frame number and the target word of the last one unvoiced frame of the target word The frame number of the unvoiced frame of one adjacent word prefix；

According to the frame number of the target word the last one unvoiced frame and next adjacent word of the target word The frame number of the unvoiced frame of words and phrases head, and fundamental frequency information and the energy letter that framing is extracted from the audio data to be trained Breath determines suffix unvoiced frame and next adjacent word prefix unvoiced frame fundamental frequency value and the energy between this of the target word Magnitude；

According to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame fundamental frequency value between this, The log difference of fundamental frequency value is calculated, it is turbid according to the suffix unvoiced frame of the target word and next adjacent word prefix Sound frame energy value between this, is calculated the log difference of energy value, wherein the log difference and the energy of the fundamental frequency value The log difference of magnitude belongs to acoustic feature changing value between institute's predicate.

In a kind of possible design, in the 7th kind of implementation of the fourth aspect of the embodiment of the present application,

The training module, specifically for obtaining target word by the word embeding layer in the prosody hierarchy marking model First output result of mark, wherein the target term identification corresponds to target word, the target word belong to it is described extremely Any one word in a few word, institute's predicate embeding layer are obtained according to the training of the first model parameter；

Target, which is obtained, by the text neural network in the prosody hierarchy marking model waits for training text characteristic set Second output result, wherein the target waits for that training text characteristic set corresponds to the target word, the text nerve net Network is obtained according to the training of the second model parameter；

The first output result, the second output result and target acoustic feature set to be trained are instructed To practice, obtains third model parameter, wherein the target acoustic feature set to be trained is corresponding to the target word, and described the Three model parameters are used to generate the acoustics neural network in the prosody hierarchy marking model；

According to first model parameter, second model parameter and the third model parameter, the rhythm is generated Restrain level marking model.

The aspect of the application the 5th provides a kind of prosody hierarchy annotation equipment, comprising: memory, transceiver, processor and Bus system；

Wherein, the memory is for storing program；

The processor is used to execute the program in the memory, includes the following steps:

According to the term identification of each word, the text feature set to be marked of each word and described every The acoustic feature set of a word obtains prosody hierarchy structure by prosody hierarchy marking model；

The bus system is for connecting the memory and the processor, so that the memory and the place Reason device is communicated.

The 6th aspect of the application provides a kind of model training apparatus, comprising: memory, transceiver, processor and bus System；

Wherein, the memory is for storing program；

To the corresponding term identification of each word, each word to training text characteristic set and described The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy mark Injection molding type is for being labeled prosody hierarchy structure；

The 7th aspect of the application provides a kind of computer readable storage medium, in the computer readable storage medium It is stored with instruction, when run on a computer, so that computer executes method described in above-mentioned various aspects.

As can be seen from the above technical solutions, the embodiment of the present application has the advantage that

In the embodiment of the present application, provide a kind of method of prosody hierarchy mark, firstly, obtain text data to be marked with And audio data, wherein there is corresponding relationship, text data to be marked includes extremely between text data and audio data to be marked A few word, each word correspond to a term identification, then according to text data to be marked extract each word to Marking text feature set, wherein text feature set to be marked includes punctuate type after part of speech, word be long and word, further according to Audio data extracts the acoustic feature set of each word, wherein acoustic feature set includes pausing after final syllable duration, word Acoustic feature changing value between duration, final syllable acoustics statistical nature and word, finally according to the term identification of each word, every The text feature set to be marked of a word and the acoustic feature set of each word, are obtained by prosody hierarchy marking model Prosody hierarchy structure.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, Ke Yiwei Prosody hierarchy structure mark provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved the rhythm The accuracy of level mark is conducive to the naturalness for promoting speech synthesis sound quality.

Detailed description of the invention

Fig. 1 is a configuration diagram of speech synthesis system in the embodiment of the present application；

Fig. 2 is a structural schematic diagram of prosody hierarchy in the embodiment of the present application；

Fig. 3 is method one embodiment schematic diagram of prosody hierarchy mark in the embodiment of the present application；

Fig. 4 is an application demonstration schematic diagram of prosody hierarchy labeling system in the embodiment of the present application；

Fig. 5 is a flow diagram of prosody hierarchy mark in the embodiment of the present application；

Fig. 6 is method one embodiment schematic diagram of model training in the embodiment of the present application；

Fig. 7 is the flow diagram that acoustic feature set is extracted in the embodiment of the present application；

Fig. 8 is one embodiment schematic diagram of fundamental curve in the embodiment of the present application；

Fig. 9 is one embodiment schematic diagram of energy curve in the embodiment of the present application；

Figure 10 is a structural schematic diagram of prosody hierarchy marking model in the embodiment of the present application；

Figure 11 is one embodiment schematic diagram of prosody hierarchy annotation equipment in the embodiment of the present application；

Figure 12 is one embodiment schematic diagram of model training apparatus in the embodiment of the present application；

Figure 13 is another embodiment schematic diagram of model training apparatus in the embodiment of the present application；

Figure 14 is another embodiment schematic diagram of model training apparatus in the embodiment of the present application；

Figure 15 is a structural schematic diagram of terminal device in the embodiment of the present application；

Figure 16 is a structural schematic diagram of server in the embodiment of the present application.

Specific embodiment

The embodiment of the present application provides method, the method and device of model training of a kind of prosody hierarchy mark, joint text Eigen and acoustic feature establish prosody hierarchy marking model, can provide spy more abundant for the mark of prosody hierarchy Sign, the more accurate prosody hierarchy marking model of use can be improved the accuracy of prosody hierarchy mark, promote speech synthesis Effect.

The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " corresponding to " and their times What is deformed, it is intended that cover it is non-exclusive include, for example, contain the process, method of a series of steps or units, system, Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for The intrinsic other step or units of these process, methods, product or equipment.

It should be understood that the application is mainly used in the Data Preparation Process of building corpus of speech synthesis to text data Carry out the automatic marking of prosody hierarchy structure.Speech synthesis is to convert text to the task of voice, and construct the language of high quality Sound synthesis system needs to prepare the data of magnanimity, wherein the data with prosody hierarchy structure mark have the naturalness of speech synthesis Great influence, traditional notation methods often artificial mark it is time-consuming and laborious for being labeled to mass data and different Mark personnel can have inconsistency on the mark of some words, and the system of an automatic marking prosody hierarchy structure is to quick The magnanimity prosody hierarchy data mark task of building speech synthesis system and the inconsistency for solving different labeled personnel have Significance.

The key technology of voice technology (Speech Technology) has automatic speech recognition technology (ASR) and voice to close At technology (TTS) and sound groove recognition technology in e.It allows computer capacity to listen, can see, can say, can feel, being the hair of the following human-computer interaction Direction is opened up, wherein voice becomes following one of the man-machine interaction mode being most expected.

In order to make it easy to understand, present applicant proposes a kind of method of prosody hierarchy mark and the method for model training, it should Method is applied to speech synthesis system shown in FIG. 1, referring to Fig. 1, Fig. 1 is speech synthesis system in the embodiment of the present application One configuration diagram, as shown, terminal device or server obtain text data and audio data first, wherein Text data and audio data are on corresponding, for example text data is " today is an auspicious day ", and audio data is then " modern It is auspicious day " audio, using forcing alignment tool to be aligned text data and audio data.Next it needs to extract text Text feature set corresponding to each word in notebook data, include in the text feature set of each word part of speech, word it is long and Punctuate type after word.At the same time, it is also necessary to carry out feature extraction also to audio data to get the acoustic feature of each word is arrived Set, the acoustic feature set of each word includes pause duration, final syllable acoustics statistical nature after final syllable duration, word And acoustic feature changing value between word, wherein acoustic feature changing value is expressed as the last unvoiced frame of current term under between word Log difference of the unvoiced frame of one word prefix on fundamental frequency, and the log difference on energy.In addition, according to text data The term identification (identification, ID) that each word can also be extracted, by the word mark of word each in entire sentence The acoustic feature set of knowledge, the text feature set of each word and each word is input to trained fascicule Grade marking model exports fascicule annotation results by the model.If prosody hierarchy marking model is deployed in terminal device, that It, can be directly according to prosody hierarchy structure after the terminal device obtains prosody hierarchy structure by prosody hierarchy marking model Play corresponding sentence.If prosody hierarchy marking model is disposed in the server, which passes through prosody hierarchy mark After injection molding type obtains prosody hierarchy structure, need by prosody hierarchy structural feedback to terminal device, by terminal device according to rhythm It restrains hierarchical structure and plays corresponding sentence.

It should be noted that terminal device include but be not limited only to tablet computer, laptop, palm PC, mobile phone, Interactive voice equipment and PC (personal computer, PC), herein without limitation.Wherein, interactive voice equipment packet Contain but be not limited only to intelligent sound and intelligent appliance.Interactive voice equipment also has a characteristic that

1, networked function, various interactive voice equipment can also pass through family by LAN connection to together Gateway interface is connected with the services sites of manufacturer, may finally be connected with internet, realizes the shared of information.

2, intelligent, interactive voice equipment can respond automatically according to the difference of ambient enviroment, not need artificially dry In advance.

3, open and compatibility, since the interactive voice equipment of user may be from different manufacturers, interactive voice Equipment needs to have explorative and compatibility.

4, energy-saving, intelligent appliance can be according to ambient enviroment adjust automatically working time and working conditions, thus real Existing energy conservation.

5, ease for use, since complicated control operating process is solved by the controller being embedded in interactive voice equipment, Therefore user need to only understand very shirtsleeve operation.Interactive voice equipment is not singly to refer to some equipment, and should be a skill Art system, with the continuous development of mankind's application demand and interactive voice device intelligence, content will more be enriched, according to The function of the different phonetic interactive device of actual application environment also can difference, but should generally have intelligent control technology.

It should be understood that the prosody hierarchy structure of speech synthesis system output specifically can be the prosody hierarchy of Chinese, Chinese is made There is tune language for one kind, prosodic features is extremely complex.Prosody hierarchy structure is the prosodic features such as pause and rhythm to voice Modeling, to speech synthesis system synthesis sound quality naturalness have great significance.A kind of typical prosody hierarchy structure stroke Point as shown in Fig. 2, referring to Fig. 2, Fig. 2 be the embodiment of the present application in prosody hierarchy a structural schematic diagram, bottom-up point For rhythm word (prosodic word, PW), prosodic phrase (prosodic phrase, PPH) and intonation phrase (intonational phrase, IPH).For example, sentence " showing best regards and hearty blessing " is inner, PW is " showing ", " sincere ", " greeting ", " and hearty ", " blessing ".PPH is " showing ", " best regards ", " and hearty blessing ".IPH For " showing best regards ", " and hearty blessing ".

In conjunction with above-mentioned introduction, the method marked to prosody hierarchy in the application is introduced below, referring to Fig. 3, this Method one embodiment of prosody hierarchy mark includes: in application embodiment

101, text data and audio data to be marked are obtained, wherein between text data and audio data to be marked With corresponding relationship, text data to be marked includes at least one word, and each word corresponds to a term identification；

In the present embodiment, text data to be marked and corresponding audio data are obtained first, wherein textual data to be marked According to specifically can be in short, perhaps one section of words language form includes but is not limited only to Chinese, Japanese, English or Korean.Sound Frequency can be an audio file according to specific.It include at least one word in text data to be marked, it is possible thereby to be divided Word, such as " showing best regards and hearty blessing " can be divided into following five words, respectively " show ", " sincere ", " greeting ", " and hearty ", " blessing ", and different words corresponds to different term identifications.

102, the text feature set to be marked of each word is extracted according to text data to be marked, wherein text to be marked Eigen set includes punctuate type after part of speech, word be long and word；

In the present embodiment, feature extraction then is carried out to each word respectively, feature extraction includes two aspects, and first is The extraction of text feature, second is the extraction of acoustic feature.During extracting text feature, need to textual data to be marked Each word carries out the extraction of text feature in, with text data to be marked " showing best regards and hearty blessing " For, text feature set to be marked corresponding to each word can be extracted, text feature set to be marked here includes But it is not limited only to punctuate type after part of speech, word be long and word.

Wherein, part of speech is generally divided into notional word and function word, and notional word is one of Chinese parts of speech, contains in word and is of practical significance Word, notional word can individually serve as sentence element, that is, have the word of lexical meaning and grammatical meaning.Using grammatical function as mainly according to According to, it is believed that it can individually serve as syntactic constituent, lexical meaning and grammatical meaning are notional words.Notional word includes noun, verb, shape Hold word, number, quantifier and pronoun.Function word refers to the vocabulary of not complete meaning, but has the word of grammatical meaning or function.With must Notional word or sentence must be depended on, grammatical meaning is indicated, cannot individually form a complete sentence, cannot individually make grammatical item, and cannot be overlapped Feature.Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.

The long length for indicating word of word, such as the word a length of 2 of " greeting ", "and" word a length of 1.

Punctuate type indicates whether closely follow punctuation mark behind this word after word, if also needed really immediately following punctuation mark Recognize the type of the punctuation mark.For punctuation mark in spoken word, showing is exactly the dead time, and it is true to may also help in people It expresses thoughts and feelings with cutting and understands written language.

103, the acoustic feature set of each word is extracted according to audio data, wherein acoustic feature set includes suffix Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after syllable duration, word；

In the present embodiment, during extracting acoustic feature, need to carry out acoustics to word each in audio data The extraction of feature can extract five groups of acoustics by taking text data to be marked " showing best regards and hearty blessing " as an example Characteristic set, acoustic feature set here include but are not limited only to pause duration, final syllable sound after final syllable duration, word Learn acoustic feature changing value between statistical nature and word.

Wherein, final syllable duration refers to the time span of word ultima, which refers to the syllable of voiced sound, Such as " time " word of " greeting ", pronounce for " hou ", voiceless sound is " h ", and voiced sound is " ou ", and final syllable duration just refers to hair The time span of " ou " this syllable, this when detected with dedicated tool, be not unfolded to discuss herein.

Pause duration refers to length this period before starting after this word is finished to next word after word, than Such as interval time length between " greeting " and "and".

Final syllable acoustics statistical nature generally includes ten parameters, and five parameters therein are with final syllable to base The parameter of frequency curvilinear correlation includes the maximum value based on logarithm fundamental curve, the minimum value based on logarithm fundamental curve, is based on The range of logarithm fundamental curve, the mean value based on logarithm fundamental curve and the variance based on logarithm fundamental curve.It is therein another Outer five parameters are the parameters with final syllable logarithmic energy curvilinear correlation, that is, include minimum value based on logarithmic energy curve, Maximum value based on logarithmic energy curve, based on the range based on logarithmic energy curve, the mean value based on logarithmic energy curve with And the variance based on logarithmic energy curve.

Between word acoustic feature changing value indicate between word tail portion voiced sound and the first voiced sound of next word to base Frequency difference value and logarithmic energy difference.

104, according to the term identification of each word, the text feature set to be marked of each word and each word Acoustic feature set obtains prosody hierarchy structure by prosody hierarchy marking model.

In the present embodiment, text data to be marked and audio data are input to prosody hierarchy marking model, the rhythm Level marking model is according to the term identification of each word, the text feature set to be marked of each word and each word Acoustic feature set exports corresponding prosody hierarchy structure.

For the ease of introducing, referring to Fig. 4, Fig. 4 is an application of prosody hierarchy labeling system in the embodiment of the present application Schematic diagram is demonstrated, as shown, user provides the text data and audio data for needing to mark prosody hierarchy structure, such as user The text data to be marked that input needs to mark is " showing best regards and hearty blessing ", then by the text to be marked Data and corresponding audio data are supplied to prosody hierarchy marking model.Feature is extracted by prosody hierarchy marking model.I.e. The text feature set to be marked of each word and the acoustic feature set of each word are extracted respectively, recycle depth nerve Network forward calculation obtains prosody hierarchy structure, and prosody hierarchy marking model mentions the text for having marked prosody hierarchy structure Supply user.

Referring to Fig. 5, Fig. 5 is a flow diagram of prosody hierarchy mark in the embodiment of the present application, as shown, tool Body, in step S1, the text data and audio number information of sentence to be marked are obtained first.In step S2, to text data into Text data is carried out forcing to be aligned by row word segmentation processing with audio data.In step S3, text data and audio data are carried out by force After system alignment, corresponding text feature and acoustic feature can be extracted.It is in step S4, the text feature of extraction and acoustics is special Sign is input to prosody hierarchy marking model, which includes feedforward neural network and two-way length nerve net in short-term Network.In step S5, by the prosody hierarchy structure of prosody hierarchy marking model output sentence.

In the embodiment of the present application, provide a kind of method of prosody hierarchy mark, firstly, obtain text data to be marked with And audio data, wherein there is corresponding relationship, text data to be marked includes extremely between text data and audio data to be marked A few word, each word correspond to a term identification, then according to text data to be marked extract each word to Marking text feature set, wherein text feature set to be marked includes punctuate type after part of speech, word be long and word, further according to Audio data extracts the acoustic feature set of each word, wherein acoustic feature set includes pausing after final syllable duration, word Acoustic feature changing value between duration, final syllable acoustics statistical nature and word, finally according to the term identification of each word, every The text feature set to be marked of a word and the acoustic feature set of each word, are obtained by prosody hierarchy marking model Prosody hierarchy structure.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, Ke Yiwei The mark of prosody hierarchy provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved fascicule The accuracy of grade mark, and promote the effect of speech synthesis.

Optionally, on the basis of above-mentioned Fig. 3 corresponding embodiment, the embodiment of the present application provides the side of prosody hierarchy mark In first alternative embodiment of method, prosody hierarchy structure is obtained by prosody hierarchy marking model, may include:

At least one of rhythm word, prosodic phrase and intonation phrase is determined by prosody hierarchy marking model；

Or, determining rhythm word and/or prosodic phrase by prosody hierarchy marking model.

In the present embodiment, two kinds of common prosody hierarchy structures will be introduced.In the first case, pass through prosody hierarchy mark Injection molding type determines at least one of rhythm word, prosodic phrase and intonation phrase, that is to say, that the training of prosody hierarchy marking model Four kinds of situations, respectively non-prosody hierarchy structure boundary, rhythm word boundary, prosodic phrase boundary and intonation phrasal boundary. In the first case, rhythm word and/or prosodic phrase are determined by prosody hierarchy marking model, that is to say, that prosody hierarchy mark Injection molding type has trained three kinds of situations, respectively non-prosody hierarchy structure boundary, rhythm word boundary and prosodic phrase boundary.

The application is when marking prosody hierarchy, after the prosody hierarchy marking model generated using the training stage is to text-processing Input text data carry out prosody hierarchy mark, so that the text of prosody hierarchy structure has been marked, to quick structure Build corpus needed for speech synthesis system.

Secondly, describing two kinds of common prosody hierarchy mask methods in the embodiment of the present application, one kind is to pass through fascicule Grade marking model determines that rhythm word, prosodic phrase and intonation phrase, another kind are to determine rhythm by prosody hierarchy marking model Restrain word and prosodic phrase.By the above-mentioned means, user can choose finer rhythm word, prosodic phrase and intonation phrase The labelling schemes of three layers of prosody hierarchy structure also can choose two layers of prosody hierarchy structure mark of rhythm word and prosodic phrase Scheme.Therefore, this programme can also select the prosody hierarchy of output according to demand, thus the flexibility of lifting scheme.

In conjunction with above-mentioned introduction, the method for model training in the application will be introduced below, referring to Fig. 6, the application Method one embodiment of model training includes: in embodiment

201, obtain to training text data and audio data to be trained, wherein to training text data with wait train There is corresponding relationship, to include at least one word in training text data, each word is corresponding to one between audio data Term identification；

In the present embodiment, obtain first to training text data and corresponding audio data to be trained, wherein wait train Text data specifically can be a word, and perhaps one section of words language form includes but is not limited only to Chinese, Japanese, English or Korea Spro Text.Audio data to be trained specifically can be an audio file.It is including at least one word to training text data, thus It can be segmented, such as " showing best regards and hearty blessing ", following five words can be divided into, respectively " be caused With ", " sincere ", " greeting ", " and hearty ", " blessing ", and different words corresponds to different term identifications.

It is understood that training when generally require a large amount of sample, here to training text data and to Training audio data is sample, will be to training text data and audio data to be trained as one for the ease of introducing Sample is illustrated, this should not be construed as the restriction to the application.

202, according to training text data extract each word to training text characteristic set, wherein to training text Eigen set includes punctuate type after part of speech, word be long and word；

In the present embodiment, feature extraction then is carried out to each word respectively, feature extraction includes two aspects, and first is The extraction of text feature, second is the extraction of acoustic feature.During extracting text feature, need to treat training text number Each word carries out the extraction of text feature in, to training text data " showing best regards and hearty blessing " For, can extract to training text characteristic set corresponding to each word, here include to training text characteristic set But it is not limited only to punctuate type after part of speech, word be long and word.

It should be noted that punctuate type is introduced in the above-described embodiments after part of speech, word length and word, therefore herein It does not repeat.

203, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, wherein sound to be trained Learning characteristic set includes pause duration after final syllable duration, word, acoustic feature between final syllable acoustics statistical nature and word Changing value；

In the present embodiment, during extracting acoustic feature, need to treat in trained audio data each word into The extraction of row acoustic feature, for training text data " showing best regards and hearty blessing ", can extract to The acoustic feature set of training word, acoustic feature set to be trained here include but are not limited only to final syllable duration, word Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word afterwards.

It should be noted that sound between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word It learns changing features value to be introduced in the above-described embodiments, therefore does not repeat herein.

204, to the corresponding term identification of each word, each word to training text characteristic set and each word Acoustic feature set to be trained be trained, obtain prosody hierarchy marking model, wherein prosody hierarchy marking model for pair Prosody hierarchy structure is labeled.

In the present embodiment, in the training process for introducing prosody hierarchy marking model, training data is to have marked fascicule The text data of level structure and corresponding audio data model sequence using deep neural network, have in each sentence Multiple words, and a sentence is exactly a word sequence, by the feature and label of each word, as deep neural network one The input and output of a time step, each word has corresponding label y, then the label of a sentence can be expressed as vector Y, The term identification of each word, text feature and sound in sentence can be extracted from text data and corresponding audio data Feature is learned, to constitute the feature x of this word, the multiple words of sentence, the input vector X that can be expressed as loses letter Number is expressed as L (Y, f (X)), is trained by a large amount of sample and loss function is allowed to become smaller as far as possible, obtain the instruction of neural network Practice parameter, to obtain the model of prosody hierarchy structure automatic marking, i.e. prosody hierarchy marking model.

In the embodiment of the present application, provide a kind of method of model training, firstly, obtain to training text data and to Training audio data, wherein to have corresponding relationship between training text data and audio data to be trained, each word is corresponding In a term identification, then according to training text data extract each word to training text characteristic set, wherein to Training text characteristic set includes punctuate type after part of speech, word be long and word, and is extracted each according to audio data train The acoustic feature set to be trained of word, wherein when pausing after training acoustic feature set includes final syllable duration, word Acoustic feature changing value between long, final syllable acoustics statistical nature and word, finally to the corresponding term identification of each word, every The acoustic feature set to be trained to training text characteristic set and each word of a word is trained, and obtains fascicule Grade marking model.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, it can be rhythm The mark for restraining level provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved prosody hierarchy The accuracy of mark, and promote the effect of speech synthesis.

Optionally, on the basis of above-mentioned Fig. 6 corresponding embodiment, the embodiment of the present application provides the method the of model training In one alternative embodiment, after obtaining to training text data and audio data to be trained, can also include:

It treats training text data and carries out word segmentation processing, obtain at least one word；

According to target term identification corresponding to default term identification Relation acquisition target word, wherein default word mark For knowledge relationship for indicating the relationship between preset each word and term identification, target word belongs at least one word In any one word；

It generates to target term vector corresponding to target word in training text data；

To the corresponding term identification of each word, each word to training text characteristic set and each word to Training acoustic feature set is trained, and is obtained prosody hierarchy marking model, be may include:

Target term identification and target term vector are trained, the first model parameter is obtained, wherein the first model ginseng Number is for generating the word embeding layer in prosody hierarchy marking model.

In the present embodiment, a kind of method of word embeding layer in training rhythm level marking model is provided.Firstly, it is necessary to obtain Take to training text data, then to this wait for training text data carry out word segmentation processing, such as to training text data be " set up Cooperative society forms new model by electric business platform ", it obtains " setting up " after participle, " cooperative society ", " by ", " electric business " " is put down Platform ", " composition ", " new model ".This thing needs to determine term identification corresponding to each word according to default term identification relationship. In order to make it easy to understand, please referring to table 1, table 1 is a signal of default term identification relationship.

Table 1

Term identification	Word
		0	It sets up
1	Cooperative society
		2	By
3	Electric business
		4	Half platform
5	Composition
		6	New model

As shown in Table 1, term identification relationship is preset to be used to indicate the relationship between word and term identification, identical word Corresponding to same term identification, it is assumed that target word is " establishment ", then the term identification of the word is " 0 ", at this time with " 0 " Input as word embeding layer.

According to the method for above-mentioned generation target term identification and target term vector, generate other term identifications and word to Amount, according to the mapping relations between term identification and term vector, is trained these term identifications and word vectors, utilizes damage Available first model parameter of function minimum is lost, which is used to generate the word in prosody hierarchy marking model Embeding layer.In practical applications, word embeding layer can be periodically updated, to promote the accuracy of word embeding layer.

Secondly, in the embodiment of the present application, a kind of method for describing trained word embeding layer first treats training text data Carry out word segmentation processing, the then target term identification according to default term identification Relation acquisition target word corresponding to, generation to Target term vector corresponding to target word in training text data, then target term identification and target term vector are instructed Practice, obtain the first model parameter, wherein the first model parameter is used to generate the word embeding layer in prosody hierarchy marking model.It is logical Aforesaid way is crossed, can directly train the word embeding layer obtained in prosody hierarchy marking model, and when training word embeding layer Other neural networks that can also simultaneously in training rhythm level marking model, to save additionally with independent neural network The process of training term vector model, improves training effectiveness.

Optionally, on the basis of above-mentioned Fig. 6 corresponding embodiment, the embodiment of the present application provides the method the of model training In two alternative embodiments, according to training text data extract each word to training text characteristic set, may include:

Obtain the punctuate type after the part of speech of target word, word length and word in training text data, wherein part of speech indicates The syntactic category of word as a result, the long number of words for indicating word of word, after word punctuate type be used to indicate after word corresponding to Punctuate type；

Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein related word To have the word of incidence relation with target word；

It is marked after and word long to punctuate type after the part of speech, word length and word of target word and the part of speech of related word, word Vertex type is trained, and obtains the second model parameter, and the second model parameter is used to generate the text in prosody hierarchy marking model Neural network.

In the present embodiment, a kind of method of this Chinese neural network of training rhythm level marking model is provided.In order to just In understanding, continues with and be introduced for the target word in training text data, it is to be understood that wait train Other word processing modes in text data are similar to the processing mode of target word, do not repeat herein.

Specifically, training text data, which carry out word segmentation processing, to be waited for this first, such as to training text data for " establishment is closed Make society, by electric business platform, form new model ", it obtains " setting up " after participle, " cooperative society ", " by ", " electric business ", " platform ", " composition ", " new model ".Assuming that target word is " cooperative society ", then the part of speech of the target word is noun, word a length of 3, after word Punctuate type is comma.In order to make it easy to understand, illustrating the relationship and word between part of speech and mark below in conjunction with table 2 and table 3 Relationship between punctuate type and mark afterwards.In practical applications, text feature lead to it is commonly used digital indicate, therefore, it is necessary to will Text concept is converted to digital concept.

Table 2

Part of speech mark	Part of speech	Citing
			0	Noun	" Shanghai, cucumber, Chinese cabbage, upper, tractor, quality, virtue "
1	Verb	" come, walk, running, paying attention to, learning, taking off, agreeing, recognizing "
			2	Adjective	" more, thin, high, ugly, snow-white, beautiful, red "
3	Adverbial word	" very, quite, pole, very, just, all, at once, at once, once "
			4	Pronoun	" I, you, you, he, she, it, we, you "
5	Preposition	", from, to, court, be, in order to, it is past, in, ratio, quilt, "
			6	Quantifier	" it is a, only, branch,, plant, "
7	Conjunction	" so, so and or "
			8	Auxiliary word	",, institute, eh,, "
9	Number	" one, two, two, three, seven, ten, hundred, thousand, ten thousand, hundred million "
			10	Interjection	" feed, Ah, hi, groan, Oh, "
11	Onomatopoeia	" toot, it is tearful, rumble, chuckle, rustle it is husky, wha la la "

Table 3

Punctuate type identification after word	Punctuate type after word	Citing
			0	Fullstop	。
1	Question mark
			2	Exclamation mark	！
3	Pause mark	、
			4	Comma	,
5	Branch	；
			6	Colon	:
7	Without punctuate

By table 2 and table 3 as it can be seen that when target word is " cooperative society ", corresponding feature is " 3 comma of noun ", can be with table It is shown as " 034 ".For rich text feature, it is also necessary to consider the word on target word periphery, i.e. acquisition related word, association Word can be the previous word of target word perhaps the latter word or the first two word etc., herein without limitation. Assuming that related word is the previous word and the latter word of target word, and target word is " cooperative society ", then being associated with Word is " establishment " and " by ".According to the content of table 2 and table 3 it is found that " establishment " corresponding feature is " verb 2 is without punctuate ". Part of speech classification number, most major term length and the punctuate classification number in corpus are counted, then marking after the long feature of part of speech feature, word and word Point feature can indicate that three only hot vectors splice to obtain the text feature of current goal word, target using solely hot vector The text feature of word and the text feature of related word are spliced, and obtain the Text eigenvector of target word, i.e., wait instruct Practice text feature set.

According to the method that said extracted waits for training text characteristic set, extract each word to training text feature set It closes, to being trained to training text characteristic set for these words, utilizes available second model of loss function minimum value Parameter, second model parameter are used to generate the text neural network in prosody hierarchy marking model.In practical applications, text Neural network can be periodically updated, to promote the accuracy of text neural network.

It is understood that text neural network can be feedforward neural network or convolutional neural networks, it can be with it The neural network of his type, memory network can be replaced two-way length by its mutation in short-term, the circulation mind such as with gating cycle unit Through network, only one signal, is not construed as the restriction to the application herein.And the application does not limit text nerve net yet The number of plies and neuron number of network.

Secondly, in the embodiment of the present application, a kind of method for describing training text neural network is first obtained to training text Punctuate type after the part of speech of target word, word length and word in notebook data, and obtain to related word in training text data Part of speech, punctuate type after word is long and word, the then punctuate type to the part of speech of target word, after word is long and word and is associated with Punctuate type is trained after the part of speech of word, word be long and word, obtains the second model parameter, the second model parameter is for generating Text neural network in prosody hierarchy marking model.By the above-mentioned means, system can be allowed to learn automatically by neural network The high-level characteristic expression for being conducive to prosody hierarchy structure mark, learns from the text feature set being originally inputted to advantageous automatically In the advanced features of mark, to promote the performance of prosody hierarchy structure automatic marking.

Optionally, on the basis of above-mentioned Fig. 6 corresponding embodiment, the embodiment of the present application provides the method the of model training In three alternative embodiments, after obtaining to training text data and audio data to be trained, can also include:

It treats training text data and audio data to be trained carries out pressure alignment, obtain the text of time unifying；

The acoustic feature set to be trained that each word is extracted according to audio data to be trained may include:

The final syllable duration of target word is determined according to time unifying text.

In the present embodiment, will introduce how to extract the acoustic feature set of word, that is, treat training text data and to Training audio data carries out pressure alignment, obtains the text of time unifying, the frame boundaries of specific available phoneme level, then word The frame boundaries of last or end syllable section can also obtain, and by the starting frame number and end frame number of final syllable, target word is calculated The final syllable duration of language.

For the ease of introducing, referring to Fig. 7, Fig. 7 is a process for extracting acoustic feature set in the embodiment of the present application Schematic diagram first obtains text data and audio data as shown, in step A1, specifically can be to training text data with And audio data to be trained.In step A2, treats training text data and carry out word segmentation processing, and by text data and audio number According to use force alignment tool obtain the text of time unifying to get arrive phoneme level boundary information.In step A4, word is determined Start-stop frame number corresponding to the boundary of last or end syllable section, equally, the last one unvoiced frame of suffix and next word prefix first are turbid The frame number of sound frame also can determine that.In step A3, logarithm fundamental curve and logarithmic energy curve are extracted by frame to audio data.Step In rapid A5, combine the text of time unifying, then the logarithm fundamental curve and logarithmic energy curve of final syllable can be obtained, with And the logarithm fundamental frequency value and logarithmic energy value of suffix voiced sound frame number and the unvoiced frame of next word prefix.In step A6, calculate To the logarithm fundamental frequency statistical nature, logarithmic energy statistical nature, suffix unvoiced frame and next word prefix unvoiced frame of final syllable Logarithm fundamental frequency difference and logarithmic energy difference.In step A7, prosody hierarchy knot is used for by what these acoustic features were spliced to form The acoustic feature set of the word of structure automatic marking task.

Specifically, training text data are treated and audio data to be trained force obtaining phoneme level after alignment Boundary information, it is assumed that training text data be " showing best regards and hearty blessing ", target word be " greeting ", Final syllable duration refers to the time span of word ultima, and by forcing the information of alignment, word can be calculated The frame boundaries of last or end syllable section, for example, " time " pronunciation of target word " greeting " is " hou ", final syllable is " ou ", then passing through The 101st frame of start frame serial number of " ou " in audio, 120 frame of end frame serial number, then " ou " pronunciation continues 20 frames, every frame 5 milliseconds, then a length of 100 milliseconds when the pronunciation of " ou ", i.e., when final syllable " greeted ", is 100 milliseconds a length of.

Secondly, after obtaining to training text data and audio data to be trained, treating instruction in the embodiment of the present application Practice text data and audio data to be trained and carry out pressure alignment, treat training text data and audio data to be trained into Row forces alignment, obtains the text of time unifying, the final syllable duration of target word is determined further according to time unifying text.It is logical Aforesaid way is crossed, the text of time unifying can be obtained, and extracts final syllable duration, using final syllable duration as acoustics One in characteristic set, learn from the acoustic feature set being originally inputted to the advanced features for being conducive to mark automatically, thus Promote the accuracy of prosody hierarchy marking model.

Optionally, on the basis of above-mentioned Fig. 6 corresponding third embodiment, the embodiment of the present application provides model training In the 4th alternative embodiment of method, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, it can To include:

Duration of pausing after the word of target word is determined according to time unifying text.

In the present embodiment, duration of pausing after how obtaining the word of word will be introduced.Specifically, treat training text data with And audio data to be trained force obtaining M speech frame after alignment, it is assumed that training text data for " it is sincere to show Greet and hearty blessing ", target word is " greeting ", then next adjacent word of target word is "and", target word The pause in short-term between " time " and "and" can be calculated according to the text data of time unifying in " time " word of language " greeting " Duration, pausing in short-term is 20 frames, and 5 milliseconds of every frame is 100 milliseconds a length of when then pausing after the word of target word.

Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated Practice text data and audio data to be trained carries out pressure alignment, obtains time unifying text, it then can be according to the time pair Neat text determines duration of pausing after word.By the above-mentioned means, can be determined after forcing aligning texts data and audio data every It pauses after the word of a word duration, using pause duration after word as one in acoustic feature set, automatically from the sound being originally inputted It learns in characteristic set and learns to the advanced features for being conducive to mark, to promote the accuracy of prosody hierarchy marking model.

Optionally, on the basis of above-mentioned Fig. 6 corresponding third embodiment, the embodiment of the present application provides model training In the 5th alternative embodiment of method, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, it can To include:

According to time unifying text and the fundamental frequency information extracted in audio data to be trained, target word is calculated Final syllable voiced sound start frame frame number and voiced sound end frame frame number；

Extract the logarithm fundamental curve and logarithmic energy curve of audio data to be trained；

According to the frame number of final syllable voiced sound start frame, the frame number of voiced sound end frame, logarithm fundamental frequency of target word The final syllable acoustics statistical nature of target word is calculated, wherein final syllable acoustics in curve and logarithmic energy curve Statistical nature includes at least one in maximum value, minimum value, interval range, average value and the variance of logarithm fundamental curve, Final syllable acoustics statistical nature further includes maximum value, minimum value, interval range, average value and the variance of logarithmic energy curve At least one of in.

In the present embodiment, the final syllable acoustics statistical nature for how obtaining word will be introduced.Specifically, training text is treated Notebook data and audio data to be trained force obtaining time unifying text after alignment, it is assumed that are to training text data The fundamental frequency and energy of corresponding audio are extracted in " showing best regards and hearty blessing ", framing, thus generate fundamental curve and Energy curve, in order to make it easy to understand, please referring to Fig. 8 and Fig. 9, Fig. 8 is one embodiment of fundamental curve in the embodiment of the present application Schematic diagram, Fig. 9 is one embodiment schematic diagram of energy curve in the embodiment of the present application, for authority data, according to two songs Line takes logarithm to obtain logarithm fundamental curve and logarithmic energy curve, and near prosody hierarchy structure boundary, fundamental frequency and energy are all It can weaken.Assuming that target word is " greeting ", then terminated according to the frame number of the voiced sound onset speech frame of " greeting " and voiced sound The frame number of speech frame, it is corresponding from intercepting the target word words and phrases tail in the logarithm fundamental curve and logarithmic energy curve of audio Logarithm fundamental curve and logarithmic energy curve, it is bent according to corresponding logarithm fundamental curve at target word words and phrases tail and logarithmic energy Line, calculates separately out the final syllable acoustics statistical nature of following ten dimensions, i.e., the maximum value of logarithm fundamental curve, to base The side of the minimum value of frequency curve, the interval range of logarithm fundamental curve, the average value of logarithm fundamental curve, logarithm fundamental curve Difference, the maximum value of logarithmic energy curve, the minimum value of logarithmic energy curve, the interval range of logarithmic energy curve, logarithmic energy The average value of curve and the variance of logarithmic energy curve.

Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated Practice text data and audio data to be trained carries out pressure alignment, obtain time unifying text, then according to time unifying text Originally the fundamental frequency information and extracted in audio data to be trained, is calculated the final syllable voiced sound start frame of target word The frame number of frame number and voiced sound end frame, logarithm fundamental curve and the logarithmic energy for extracting audio data to be trained are bent Line, it is finally bent according to the frame number of final syllable voiced sound start frame of target word, the frame number of voiced sound end frame, logarithm fundamental frequency The final syllable acoustics statistical nature of target word is calculated in line and logarithmic energy curve.By the above-mentioned means, when obtaining Between the text data that is aligned can obtain the start frame and knot of suffix voiced segments further according to the fundamental frequency information extracted in audio The frame number of beam frame learns from the acoustic feature set being originally inputted to the advanced features for being conducive to mark, to mention automatically Rise the accuracy of prosody hierarchy marking model.

Optionally, on the basis of above-mentioned Fig. 6 corresponding third embodiment, the embodiment of the present application provides model training In the 7th alternative embodiment of method, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, it can To include:

According to time unifying text and the fundamental frequency information extracted in audio data to be trained, target word is calculated The last one unvoiced frame frame number and target word next adjacent word prefix unvoiced frame frame number；

According to next adjacent word prefix of the frame number of target word the last one unvoiced frame and target word The frame number of unvoiced frame, and framing is extracted from audio data to be trained fundamental frequency information and energy information, determine target The suffix unvoiced frame of word and next adjacent word prefix unvoiced frame fundamental frequency value and energy value between this；

According to the suffix unvoiced frame of target word and next adjacent word prefix unvoiced frame fundamental frequency value between this, calculate The log difference of fundamental frequency value is obtained, according to the suffix unvoiced frame of target word and next adjacent word prefix unvoiced frame this it Between energy value, the log difference of energy value is calculated, wherein the log difference of fundamental frequency value and the log difference category of energy value The acoustic feature changing value between word.

In the present embodiment, the acoustic feature changing value how obtained between word will be introduced.Specifically, training text data are treated And audio data to be trained force obtaining time unifying text after alignment, it is assumed that training text data is " to show The fundamental frequency information and energy information of corresponding audio data to be trained are extracted in best regards and hearty blessing ", framing, thus raw At fundamental curve and energy curve, for authority data, logarithm is taken to obtain logarithm fundamental curve and logarithm according to two curves Energy curve.Assuming that target word be " greetings ", then according to time unifying text and to audio framing extraction fundamental frequency information, The last one unvoiced frame that can determine " time ", the frame number with first unvoiced frame of next word "and", then this two frame Fundamental frequency value and energy value can obtain, then calculate this two frame logarithm fundamental frequency difference and logarithmic energy it is poor.

Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated Practice text data and audio data to be trained carries out pressure alignment, obtains time unifying text, it then can be according to target word The frame number of the unvoiced frame of next adjacent word of the last one unvoiced frame of language and target word, and mentioned from audio framing The fundamental frequency and energy datum taken, determine target word words and phrases tail unvoiced frame fundamental frequency value and energy value and next word prefix it is turbid The fundamental frequency value and energy value of both sound frames, then carry out the log difference that the two fundamental frequency value is calculated, pair of the two energy value Difference is counted, thus the changing value as acoustic feature between word.By the above-mentioned means, can be automatically from the acoustic feature being originally inputted Learn in set to the advanced features for being conducive to mark, to promote the accuracy of prosody hierarchy marking model.

Optionally, above-mentioned Fig. 6 and Fig. 6 it is corresponding first to any one of the 7th embodiment on the basis of, this Apply for that embodiment is provided in the 8th alternative embodiment of method of model training, to the corresponding term identification of each word, each The acoustic feature set to be trained to training text characteristic set and each word of word is trained, and obtains prosody hierarchy Marking model may include:

The first output result of target term identification is obtained by the word embeding layer in prosody hierarchy marking model, wherein Target term identification corresponds to target word, and target word belongs to any one word at least one word, word embeding layer It is obtained according to the training of the first model parameter；

Target, which is obtained, by the text neural network in prosody hierarchy marking model waits for the second of training text characteristic set Export result, wherein target waits for that training text characteristic set corresponds to target word, and text neural network is according to the second model What parameter training obtained；

First output result, the second output result and target acoustic feature set to be trained are trained, obtain the Three model parameters, wherein target acoustic feature set to be trained corresponds to target word, and third model parameter is for generating the rhythm Acoustics neural network in level marking model；

According to the first model parameter, the second model parameter and third model parameter, prosody hierarchy marking model is generated.

In the present embodiment, the method trained and obtain prosody hierarchy marking model will be introduced, in order to make it easy to understand, please referring to figure 10, Figure 10 be a structural schematic diagram of prosody hierarchy marking model in the embodiment of the present application, as shown, being with target word Example, i.e. term identification are target term identification, and text feature collection is combined into target corresponding to target word language and waits for training text feature Set, acoustic feature collection are combined into the acoustic feature set to be trained of target corresponding to target word language.Using target term identification as Thus the input of word embeding layer exports the first output as a result, the first output result is specially that target word identity map obtains Term vector, the term vector can be 200 dimensions.By target after training text characteristic set (punctuate class after part of speech, word be long and word Type) input as text neural network (such as feed forward neural network), thus export the second output result.By target sound to be trained Characteristic set, the first output result and the second output result are learned collectively as acoustics neural network (such as two-way long short-term memory net Network) input, the posterior probability of each prosody hierarchy structure type of the target word will be exported by softmax layers, for example, non- The probability on prosody hierarchy boundary is 0.1, and the probability of rhythm word is 0.1, and the probability of prosodic phrase is 0.2, the probability of intonation phrase It is 0.6, takes the corresponding prosody hierarchy structure of maximum a posteriori probability, as annotation results, then the annotation results of target word are Intonation phrase.This annotation results is the prediction result that training obtains, and needs to be compared with legitimate reading, i.e., using loss letter Number takes the minimum value of the two to determine the third model parameter of acoustics neural network.Join in conjunction with the first model parameter, the second model Several and third model parameter, training obtain prosody hierarchy marking model.Prosody hierarchy marking model using feedforward neural network with The stacked structure of two-way length memory network in short-term, can mark three kinds of rhythm word, prosodic phrase and intonation phrase fascicules simultaneously Level structure.

Loss function is the inconsistent degree of the predicted value and true value for estimating your model, it is a non-negative real value Function.Loss function used by the application can use cross entropy, it is possible to use the cross entropy of Weight.

It is understood that memory network is trained together in short-term for word embeding layer, feedforward neural network and two-way length.Word For embeding layer for training term vector, feedforward neural network is originally inputted feature (punctuate after part of speech, word be long and word to automatically extract Type) in the more favorable high-level characteristic of mark task is indicated.In two-way length memory network input in short-term, by these features It is stitched together, thus joint utilizes text feature and acoustic feature.

Memory network can learn to the dependence between context two-way length in short-term, because mark task is also required to up and down Literary information, such as previous word are intonation phrasal boundary, then current word is just less likely for intonation phrasal boundary, therefore, connection It closes and not only may be used using the stacked structure of feedforward neural network and two-way length memory network in short-term using trainable word embeding layer , can also be automatically from Text character extraction high-level characteristic to utilize text and acoustic feature information, and contextual feature is utilized, Task is marked suitable for prosody hierarchy structure.

Further, in the embodiment of the present application, a kind of method that training obtains prosody hierarchy marking model is described, that is, is needed It trains to obtain three classes model parameter, respectively the first model parameter, the second model parameter and third model parameter, the first mould Shape parameter, the second model parameter and third model parameter as a whole, while training and generating prosody hierarchy marking model. A complete prosody hierarchy marking model is formed by the above-mentioned means, being stacked by the neural network of three parts, and as one A entirety carries out model training, and trained content includes the training between term identification and word vectors, text including words and phrases and word The training of the training of text feature and audio and acoustic feature helps to mention thus, it is possible to obtain feature more abundant Rise the accuracy of sentence mark.

The prosody hierarchy annotation equipment in the application is described in detail below, please refers to Figure 11, Figure 11 is the application Prosody hierarchy annotation equipment one embodiment schematic diagram in embodiment, prosody hierarchy annotation equipment 30 include:

Module 301 is obtained, for obtaining text data and audio data to be marked, wherein the textual data to be marked According to having corresponding relationship between the audio data, the text data to be marked includes at least one word, each word Corresponding to a term identification；

Extraction module 302 extracts each word for obtaining the text data to be marked that 301 take according to the acquisition module The text feature set to be marked of language, wherein the text feature set to be marked includes punctuate class after part of speech, word be long and word Type；

The extraction module 302 is also used to be extracted according to the audio data that the acquisition module 301 obtains described every The acoustic feature set of a word, wherein the acoustic feature set includes pause duration, suffix after final syllable duration, word Acoustic feature changing value between syllable acoustics statistical nature and word；

Prediction module 303, for according to the term identification of each word, the extraction of the extraction module 302 The text feature set to be marked of each word and the acoustic feature set of each word, mark mould by prosody hierarchy Type obtains prosody hierarchy structure.

In the present embodiment, obtains module 301 and obtain text data and audio data to be marked, wherein is described to be marked There is corresponding relationship, the text data to be marked includes at least one word, often between text data and the audio data A word corresponds to a term identification, the textual data to be marked that extraction module 302 is obtained according to the acquisition module 301 According to the text feature set to be marked for extracting each word, wherein the text feature set to be marked include part of speech, word it is long with And punctuate type after word, the extraction module 302 extract described every according to the audio data that the acquisition module 301 obtains The acoustic feature set of a word, wherein the acoustic feature set includes pause duration, suffix after final syllable duration, word Acoustic feature changing value between syllable acoustics statistical nature and word, prediction module 303 according to the term identification of each word, The text feature set to be marked for each word that the extraction module 302 extracts and the acoustics of each word are special Collection is closed, and obtains prosody hierarchy structure by prosody hierarchy marking model.

In the embodiment of the present application, provide a kind of prosody hierarchy annotation equipment, firstly, obtain text data to be marked and Audio data, wherein there is corresponding relationship, text data to be marked includes at least between text data and audio data to be marked One word, each word correspond to a term identification, then according to text data to be marked extract each word wait mark Explanatory notes eigen set, wherein text feature set to be marked includes punctuate type after part of speech, word be long and word, further according to sound Frequency is according to the acoustic feature set for extracting each word, wherein acoustic feature set includes when pausing after final syllable duration, word Acoustic feature changing value between long, final syllable acoustics statistical nature and word, finally according to the term identification of each word, each The text feature set to be marked of word and the acoustic feature set of each word obtain rhythm by prosody hierarchy marking model Restrain hierarchical structure.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, it can be rhythm The mark for restraining level provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved prosody hierarchy The accuracy of mark.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 11, prosody hierarchy mark provided by the embodiments of the present application In another embodiment of dispensing device 30

The prediction module 303, specifically for by the prosody hierarchy marking model determine rhythm word, prosodic phrase with And at least one of intonation phrase；

Or,

The model training apparatus in the application is described in detail below, please refers to Figure 12, Figure 12 is the application implementation Model training apparatus one embodiment schematic diagram in example, model training apparatus 40 include:

Module 401 is obtained, for obtaining to training text data and audio data to be trained, wherein described wait train There is corresponding relationship between text data and the audio data to be trained, it is described to include at least one in training text data Word, each word correspond to a term identification；

Extraction module 402, for extracting each word to training text data according to the acquisition module 401 acquisition Language to training text characteristic set, wherein it is described after training text characteristic set include punctuate class after part of speech, word be long and word Type；

The extraction module 402 is also used to the audio data extraction to be trained according to the acquisition module 401 acquisition The acoustic feature set to be trained of each word, wherein the acoustic feature set to be trained include final syllable duration, Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after word；

Training module 403, the institute for being extracted to the corresponding term identification of each word, the extraction module 402 The acoustic feature set to be trained to training text characteristic set and each word for stating each word is trained, and is obtained To prosody hierarchy marking model, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure.

In the present embodiment, obtains module 401 and obtain to training text data and audio data to be trained, wherein is described It is described to include extremely in training text data to have corresponding relationship between training text data and the audio data to be trained A few word, each word correspond to a term identification, the institute that extraction module 402 is obtained according to the acquisition module 401 State to training text data extract each word to training text characteristic set, wherein it is described to training text characteristic set Including part of speech, punctuate type after word is long and word, the extraction module 402 according to the acquisitions module 401 acquisition to Training audio data extracts the acoustic feature set to be trained of each word, wherein the acoustic feature set to be trained Including acoustic feature changing value, training between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word Each word that module 403 extracts the corresponding term identification of each word, the extraction module 402 wait train Text feature set and the acoustic feature set to be trained of each word are trained, and obtain prosody hierarchy mark mould Type, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure.

In the embodiment of the present application, provide a kind of method of model training, firstly, obtain to training text data and to Training audio data, wherein to have corresponding relationship between training text data and audio data to be trained, each word is corresponding In a term identification, then according to training text data extract each word to training text characteristic set, wherein to Training text characteristic set includes punctuate type after part of speech, word be long and word, and is extracted each according to audio data train The acoustic feature set to be trained of word, wherein when pausing after training acoustic feature set includes final syllable duration, word Acoustic feature changing value between long, final syllable acoustics statistical nature and word, finally to the corresponding term identification of each word, every The acoustic feature set to be trained to training text characteristic set and each word of a word is trained, and obtains fascicule Grade marking model.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, it can be rhythm It restrains level mark task and feature more abundant is provided, the more accurate prosody hierarchy marking model of use can be improved fascicule The accuracy of grade mark, and promote the effect of speech synthesis.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 13 is please referred to, it is provided by the embodiments of the present application In another embodiment of model training apparatus 40, the model training apparatus 40 further includes processing module 404 and generation module 405；

The processing module 404, for obtaining in the acquisition module 401 to training text data and audio to be trained After data, word segmentation processing is carried out to training text data to described, obtains at least one word；

The acquisition module 401 is also used to the target word according to corresponding to default term identification Relation acquisition target word Language mark, wherein the default term identification relationship is used to indicate the pass between preset each word and term identification System, the target word belong to any one word at least one described word that the processing module is handled；

The generation module 405, it is described to target corresponding to target word described in training text data for generating Term vector；

The training module 403, specifically for the target term identification obtained to the acquisition module 401 and institute The target term vector for stating the generation of generation module 405 is trained, and obtains the first model parameter, wherein first model Parameter is used to generate the word embeding layer in the prosody hierarchy marking model.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, model training dress provided by the embodiments of the present application It sets in 40 another embodiment,

The extraction module 402, specifically for obtain it is described to the part of speech of target word, word in training text data it is long with And punctuate type after word, wherein the part of speech indicates the syntactic category of word as a result, the long number of words for indicating word of institute's predicate, institute Punctuate type is used to indicate punctuate type corresponding after word after predicate；

The training module 403, specifically for the part of speech to the target word, punctuate type after word is long and word, and Punctuate type is trained after the part of speech of the related word, word be long and word, obtains the second model parameter, second model Parameter is used to generate the text neural network in the prosody hierarchy marking model.

Secondly, in the embodiment of the present application, a kind of method for describing training text neural network is first obtained to training text Punctuate type after the part of speech of target word, word length and word in notebook data, and obtain to related word in training text data Part of speech, punctuate type after word is long and word, the then punctuate type to the part of speech of target word, after word is long and word and is associated with Punctuate type is trained after the part of speech of word, word be long and word, obtains the second model parameter, the second model parameter is for generating Text neural network in prosody hierarchy marking model.By the above-mentioned means, system can be allowed to learn automatically by neural network The high-level characteristic expression for being conducive to prosody hierarchy structure mark, learns from the text feature set being originally inputted to advantageous automatically In the advanced features of mark, to promote the accuracy of prosody hierarchy marking model.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 14 is please referred to, it is provided by the embodiments of the present application In another embodiment of model training apparatus 40, the model training apparatus 40 further includes alignment module 406；

The alignment module 406 is obtained for the acquisition module 401 to training text data and audio number to be trained According to later, pressure alignment is carried out to training text data and the audio data to be trained to described, obtains time unifying Text；

The extraction module 402, specifically for determining the suffix sound of the target word according to the time unifying text Save duration.

Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 14, model training dress provided by the embodiments of the present application It sets in 40 another embodiment,

The extraction module 402 stops after the word specifically for determining the target word according to the time unifying text Immediately it grows.

The extraction module 402 is specifically used for according to the time unifying text and from the audio data to be trained In the fundamental frequency information that extracts, the frame number and voiced sound that the final syllable voiced sound start frame of the target word is calculated terminate The frame number of frame；

The extraction module 402 is specifically used for according to the time unifying text and from the audio data to be trained In the fundamental frequency information that extracts, the frame number and the target word of the last one unvoiced frame of the target word is calculated Next adjacent word prefix unvoiced frame frame number；

Optionally, provided by the embodiments of the present application on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 13 or Figure 14 In another embodiment of model training apparatus 40,

The training module 403, specifically for obtaining target by the word embeding layer in the prosody hierarchy marking model First output result of term identification, wherein the target term identification corresponds to target word, and the target word belongs to institute Any one word at least one word is stated, institute's predicate embeding layer is obtained according to the training of the first model parameter；

The embodiment of the present application also provides another prosody hierarchy annotation equipments, as shown in figure 15, for ease of description, only Part relevant to the embodiment of the present application is shown, it is disclosed by specific technical details, please refer to the embodiment of the present application method portion Point.The terminal device can be include mobile phone, tablet computer, personal digital assistant (personal digital assistant, PDA), any terminal device equipment such as point-of-sale terminal equipment (point of sales, POS), vehicle-mounted computer is with terminal device For mobile phone:

Figure 15 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided by the embodiments of the present application.Ginseng Figure 15 is examined, mobile phone includes: radio frequency (radio frequency, RF) circuit 510, memory 520, input unit 530, display list First 540, sensor 550, voicefrequency circuit 560, Wireless Fidelity (wireless fidelity, WiFi) module 570, processor The components such as 580 and power supply 590.It will be understood by those skilled in the art that handset structure shown in Figure 15 does not constitute opponent The restriction of machine may include perhaps combining certain components or different component layouts than illustrating more or fewer components.

It is specifically introduced below with reference to each component parts of the Figure 15 to mobile phone:

RF circuit 510 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station After downlink information receives, handled to processor 580；In addition, the data for designing uplink are sent to base station.In general, RF circuit 510 Including but not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (low noise Amplifier, LNA), duplexer etc..In addition, RF circuit 510 can also be communicated with network and other equipment by wireless communication. Any communication standard or agreement, including but not limited to global system for mobile communications (global can be used in above-mentioned wireless communication System of mobile communication, GSM), general packet radio service (general packet radio Service, GPRS), CDMA (code division multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), Email, short message service (short messaging service, SMS) etc..

Memory 520 can be used for storing software program and module, and processor 580 is stored in memory 520 by operation Software program and module, thereby executing the various function application and data processing of mobile phone.Memory 520 can mainly include Storing program area and storage data area, wherein storing program area can application journey needed for storage program area, at least one function Sequence (such as sound-playing function, image player function etc.) etc.；Storage data area can be stored to be created according to using for mobile phone Data (such as audio data, phone directory etc.) etc..It, can be in addition, memory 520 may include high-speed random access memory Including nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-states Part.

Input unit 530 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with And the related key signals input of function control.Specifically, input unit 530 may include that touch panel 531 and other inputs are set Standby 532.Touch panel 531, also referred to as touch screen, collect user on it or nearby touch operation (such as user use The operation of any suitable object or attachment such as finger, stylus on touch panel 531 or near touch panel 531), and root Corresponding attachment device is driven according to preset formula.Optionally, touch panel 531 may include touch detecting apparatus and touch Two parts of controller.Wherein, the touch orientation of touch detecting apparatus detection user, and touch operation bring signal is detected, Transmit a signal to touch controller；Touch controller receives touch information from touch detecting apparatus, and is converted into touching Point coordinate, then gives processor 580, and can receive order that processor 580 is sent and be executed.Furthermore, it is possible to using electricity The multiple types such as resistive, condenser type, infrared ray and surface acoustic wave realize touch panel 531.In addition to touch panel 531, input Unit 530 can also include other input equipments 532.Specifically, other input equipments 532 can include but is not limited to secondary or physical bond One of disk, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. are a variety of.

Display unit 540 can be used for showing information input by user or be supplied to user information and mobile phone it is various Menu.Display unit 540 may include display panel 541, optionally, can use liquid crystal display (liquid crystal Display, LCD), the forms such as Organic Light Emitting Diode (organic light-emitting diode, OLED) it is aobvious to configure Show panel 541.Further, touch panel 531 can cover display panel 541, when touch panel 531 detect it is on it or attached After close touch operation, processor 580 is sent to determine the type of touch event, is followed by subsequent processing device 580 according to touch event Type corresponding visual output is provided on display panel 541.Although in Figure 15, touch panel 531 and display panel 541 It is that the input and input function of mobile phone are realized as two independent components, but in some embodiments it is possible to by touch-control Panel 531 and display panel 541 are integrated and that realizes mobile phone output and input function.

Mobile phone may also include at least one sensor 550, such as optical sensor, motion sensor and other sensors. Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light Light and shade adjust the brightness of display panel 541, proximity sensor can close display panel 541 when mobile phone is moved in one's ear And/or backlight.As a kind of motion sensor, accelerometer sensor can detect (generally three axis) acceleration in all directions Size, can detect that size and the direction of gravity when static, can be used to identify the application of mobile phone posture, (for example horizontal/vertical screen is cut Change, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.；May be used also as mobile phone The other sensors such as gyroscope, barometer, hygrometer, thermometer, the infrared sensor of configuration, details are not described herein.

Voicefrequency circuit 560, loudspeaker 561, microphone 562 can provide the audio interface between user and mobile phone.Audio-frequency electric Electric signal after the audio data received conversion can be transferred to loudspeaker 561, be converted to sound by loudspeaker 561 by road 560 Signal output；On the other hand, the voice signal of collection is converted to electric signal by microphone 562, is turned after being received by voicefrequency circuit 560 It is changed to audio data, then by after the processing of audio data output processor 580, such as another mobile phone is sent to through RF circuit 510, Or audio data is exported to memory 520 to be further processed.

WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 570 Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 15 is shown WiFi module 570, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely Become in the range of the essence of invention and omits.

Processor 580 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, is led to It crosses operation or executes the software program and/or module being stored in memory 520, and call and be stored in memory 520 Data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 580 can wrap Include one or more processing units；Optionally, processor 580 can integrate application processor and modem processor, wherein answer With the main processing operation system of processor, user interface and application program etc., modem processor mainly handles wireless communication. It is understood that above-mentioned modem processor can not also be integrated into processor 580.

Mobile phone further includes the power supply 590 (such as battery) powered to all parts, and optionally, power supply can pass through power supply pipe Reason system and processor 580 are logically contiguous, to realize management charging, electric discharge and power managed by power-supply management system Etc. functions.

Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.

In the embodiment of the present application, processor 580 included by the terminal device is also with the following functions:

Figure 16 is a kind of server architecture schematic diagram provided by the embodiments of the present application, which can be because of configuration or property Energy is different and generates bigger difference, may include one or more central processing units (central processing Units, CPU) 622 (for example, one or more processors) and memory 632, one or more storages apply journey The storage medium 630 (such as one or more mass memory units) of sequence 642 or data 644.Wherein, 632 He of memory Storage medium 630 can be of short duration storage or persistent storage.The program for being stored in storage medium 630 may include one or one With upper module (diagram does not mark), each module may include to the series of instructions operation in server.Further, in Central processor 622 can be set to communicate with storage medium 630, execute on server 600 a series of in storage medium 630 Instruction operation.

Server 600 can also include one or more power supplys 626, one or more wired or wireless networks Interface 650, one or more input/output interfaces 658, and/or, one or more operating systems 641, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by server can be based on server architecture shown in the Figure 16 in above-described embodiment.

In the embodiment of the present application, CPU 622 included by the server is also with the following functions:

With artificial intelligence technology research and progress, research and application is unfolded in multiple fields in artificial intelligence technology, such as Common smart home, intelligent wearable device, virtual assistant, intelligent sound box, intelligent marketing, unmanned, automatic Pilot, nobody Machine, robot, intelligent medical, intelligent customer service etc., it is believed that with the development of technology, artificial intelligence technology will obtain in more fields To application, and play more and more important value.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program The medium of code.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of method of prosody hierarchy mark characterized by comprising

Obtain text data and audio data to be marked, wherein between the text data to be marked and the audio data With corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a term identification；

The text feature set to be marked of each word is extracted according to the text data to be marked, wherein the text to be marked Eigen set includes punctuate type after part of speech, word be long and word, and the audio data is voice data；

The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set includes Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word；

According to the term identification of each word, the text feature set to be marked of each word and each word The acoustic feature set of language obtains prosody hierarchy structure by prosody hierarchy marking model, wherein the prosody hierarchy structure packet At least one of rhythm word, prosodic phrase and intonation phrase is included, or, the prosody hierarchy structure includes that rhythm word and the rhythm are short At least one of language.

2. a kind of method of model training characterized by comprising

Obtain to training text data and audio data to be trained, wherein it is described to training text data with described wait train There is corresponding relationship between audio data, it is described to which, including at least one word, each word corresponds in training text data One term identification, the audio data to be trained are voice data；

According to it is described to training text data extract each word to training text characteristic set, wherein it is described to training text Eigen set includes punctuate type after part of speech, word be long and word；

The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described wait instruct Vocalism characteristic set includes pause duration after final syllable duration, word, acoustics between final syllable acoustics statistical nature and word Changing features value；

To the corresponding term identification of each word, each word to training text characteristic set and described each The acoustic feature set to be trained of word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy marks mould For type for being labeled to prosody hierarchy structure, the prosody hierarchy structure includes rhythm word, prosodic phrase and intonation phrase At least one, or, the prosody hierarchy structure includes at least one of rhythm word and prosodic phrase.

3. according to the method described in claim 2, it is characterized in that, the acquisition is to training text data and audio to be trained After data, the method also includes:

Word segmentation processing is carried out to training text data to described, obtains at least one word；

According to target term identification corresponding to default term identification Relation acquisition target word, wherein the default word mark Knowledge relationship is used to indicate relationship between preset each word and term identification, the target word belong to it is described at least Any one word in one word；

It generates described to target term vector corresponding to target word described in training text data；

It is described to the corresponding term identification of each word, each word to training text characteristic set and described The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, comprising:

The target term identification and the target term vector are trained, the first model parameter is obtained, wherein described One model parameter is used to generate word embeding layer in the prosody hierarchy marking model, institute's predicate embeding layer within the object time into Row updates.

4. according to the method described in claim 2, it is characterized in that, described extract each word to training text data according to Language to training text characteristic set, comprising:

Obtain the punctuate type after the part of speech of target word, word length and word in training text data, wherein the part of speech The syntactic category of word is indicated as a result, institute's predicate length indicates the number of words of word, punctuate type is for indicating in word after institute's predicate Corresponding punctuate type later；

Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein the association Word is the word for having incidence relation with the target word；

Using loss function to the word of punctuate type and the related word after the part of speech, word length and word of the target word Property, punctuate type is trained after word is long and word；

When the loss function reaches minimum value, the second model parameter is obtained, second model parameter is described for generating Text neural network in prosody hierarchy marking model.

5. according to the method described in claim 2, it is characterized in that, the acquisition is to training text data and audio to be trained After data, the method also includes:

Pressure alignment is carried out to training text data and the audio data to be trained to described, obtains the text of time unifying This；

The audio data to be trained according to extracts the acoustic feature set to be trained of each word, comprising:

The final syllable duration of the target word is determined according to the time unifying text.

6. according to the method described in claim 5, it is characterized in that, the audio data extraction to be trained according to is described every The acoustic feature set to be trained of a word, comprising:

Duration of pausing after the word of the target word is determined according to the time unifying text.

7. according to the method described in claim 5, it is characterized in that, the audio data extraction to be trained according to is described every The acoustic feature set to be trained of a word, comprising:

According to the time unifying text and the fundamental frequency information extracted in the audio data to be trained, it is calculated described The frame number of the final syllable voiced sound start frame of target word and the frame number of voiced sound end frame；

According to the frame number of final syllable voiced sound start frame, the frame number of voiced sound end frame, the logarithm of the target word The final syllable acoustics statistical nature of the target word is calculated, wherein described in fundamental curve and logarithmic energy curve Final syllable acoustics statistical nature includes maximum value, minimum value, interval range, average value and the side of the logarithm fundamental curve In difference at least one of, the final syllable acoustics statistical nature further include the maximum value of the logarithmic energy curve, minimum value, At least one of in interval range, average value and variance.

8. according to the method described in claim 5, it is characterized in that, the audio data extraction to be trained according to is described every The acoustic feature set to be trained of a word, comprising:

According to the time unifying text and the fundamental frequency information extracted in the audio data to be trained, it is calculated described The unvoiced frame of next adjacent word prefix of the frame number of the last one unvoiced frame of target word and the target word Frame number；

According to the frame number of the target word the last one unvoiced frame and next adjacent word word of the target word The frame number of first unvoiced frame, and framing is extracted from the audio data to be trained fundamental frequency information and energy information, Determine suffix unvoiced frame and next adjacent word prefix unvoiced frame fundamental frequency value and the energy between this of the target word Value；

According to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame fundamental frequency value between this, calculate The log difference of fundamental frequency value is obtained, according to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame The log difference of energy value is calculated in energy value between this, wherein the log difference and the energy value of the fundamental frequency value Log difference belong to acoustic feature changing value between institute's predicate.

9. the method according to any one of claim 2 to 8, which is characterized in that described corresponding to each word The acoustic feature set to be trained to training text characteristic set and each word of term identification, each word It is trained, obtains prosody hierarchy marking model, comprising:

The first output result of target term identification is obtained by the word embeding layer in the prosody hierarchy marking model, wherein The target term identification corresponds to target word, and the target word belongs to any one word at least one described word Language, institute's predicate embeding layer are obtained according to the training of the first model parameter；

Target, which is obtained, by the text neural network in the prosody hierarchy marking model waits for the second of training text characteristic set Export result, wherein the target waits for that training text characteristic set corresponds to the target word, and the text neural network is It is obtained according to the training of the second model parameter；

The first output result, the second output result and target acoustic feature set to be trained are trained, obtained To third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, the third model Parameter is used to generate the acoustics neural network in the prosody hierarchy marking model；

According to first model parameter, second model parameter and the third model parameter, the fascicule is generated Grade marking model.

10. a kind of prosody hierarchy annotation equipment characterized by comprising

Obtain module, for obtaining text data and audio data to be marked, wherein the text data to be marked with it is described There is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to one between audio data A term identification, the audio data are voice data；

Extraction module, the text data to be marked for being obtained according to the acquisition module extract the to be marked of each word Text feature set, wherein the text feature set to be marked includes punctuate type after part of speech, word be long and word；

The extraction module is also used to extract the sound of each word according to the audio data that the acquisition module obtains Learn characteristic set, wherein the acoustic feature set includes pause duration, final syllable acoustics system after final syllable duration, word Count acoustic feature changing value between feature and word；

Prediction module, each word for being extracted according to the term identification of each word, the extraction module The acoustic feature set of text feature set to be marked and each word obtains the rhythm by prosody hierarchy marking model Hierarchical structure, wherein the prosody hierarchy structure includes at least one of rhythm word, prosodic phrase and intonation phrase, or, institute State at least one that prosody hierarchy structure includes rhythm word and prosodic phrase.

11. a kind of model training apparatus characterized by comprising

Module is obtained, for obtaining to training text data and audio data to be trained, wherein described to training text data There is corresponding relationship between the audio data to be trained, it is described to include at least one word in training text data, often A word corresponds to a term identification, and the audio data to be trained is voice data；

Extraction module, for according to it is described acquisition module obtain described in training text data extract each word wait train Text feature set, wherein it is described after training text characteristic set include punctuate type after part of speech, word be long and word；

The extraction module is also used to audio data extraction to be trained each word according to acquisition module acquisition The acoustic feature set to be trained of language, wherein it is described after training acoustic feature set include final syllable duration, word after pause when Acoustic feature changing value between long, final syllable acoustics statistical nature and word；

Training module, each word for being extracted to the corresponding term identification of each word, the extraction module The acoustic feature set to be trained to training text characteristic set and each word be trained, obtain prosody hierarchy Marking model, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure, the prosody hierarchy structure At least one including rhythm word, prosodic phrase and intonation phrase, or, the prosody hierarchy structure includes rhythm word and the rhythm At least one of phrase.

12. a kind of terminal device characterized by comprising memory, transceiver, processor and bus system；

Wherein, the memory is for storing program；

According to the term identification of each word, the text feature set to be marked of each word and each word The acoustic feature set of language obtains prosody hierarchy structure by prosody hierarchy marking model, wherein the prosody hierarchy structure packet At least one of rhythm word, prosodic phrase and intonation phrase is included, or, the prosody hierarchy structure includes that rhythm word and the rhythm are short At least one of language；

The bus system is for connecting the memory and the processor, so that the memory and the processor It is communicated.

13. a kind of server characterized by comprising memory, transceiver, processor and bus system；

Wherein, the memory is for storing program；

To the corresponding term identification of each word, each word to training text characteristic set and described each The acoustic feature set to be trained of word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy marks mould For type for being labeled to prosody hierarchy structure, the prosody hierarchy structure includes rhythm word, prosodic phrase and intonation phrase At least one, or, the prosody hierarchy structure includes at least one of rhythm word and prosodic phrase；

14. a kind of intelligent speech interactive system, which is characterized in that a kind of intelligent speech interactive system includes voice collecting mould Block, speech processes analysis module and memory module；

Wherein, the voice acquisition module, for obtaining text data and audio data to be marked, wherein described to be marked There is corresponding relationship, the text data to be marked includes at least one word, often between text data and the audio data A word corresponds to a term identification；

The speech processes analysis module, the text to be marked for extracting each word according to the text data to be marked are special Collection is closed, wherein the text feature set to be marked includes punctuate type after part of speech, word be long and word, the audio data For voice data；

The memory module, for storing the prosody hierarchy structure.

15. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as Method described in claim 1, or execute the method as described in any one of claim 2 to 9.