CN110444191A - A kind of method, the method and device of model training of prosody hierarchy mark - Google Patents
A kind of method, the method and device of model training of prosody hierarchy mark Download PDFInfo
- Publication number
- CN110444191A CN110444191A CN201910751371.6A CN201910751371A CN110444191A CN 110444191 A CN110444191 A CN 110444191A CN 201910751371 A CN201910751371 A CN 201910751371A CN 110444191 A CN110444191 A CN 110444191A
- Authority
- CN
- China
- Prior art keywords
- word
- trained
- text
- prosody hierarchy
- audio data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000012549 training Methods 0.000 title claims abstract description 291
- 238000000034 method Methods 0.000 title claims abstract description 87
- 230000033764 rhythmic process Effects 0.000 claims description 57
- 238000000605 extraction Methods 0.000 claims description 51
- 238000013528 artificial neural network Methods 0.000 claims description 45
- 230000015654 memory Effects 0.000 claims description 42
- 239000000284 extract Substances 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 28
- 238000012545 processing Methods 0.000 claims description 28
- 239000013598 vector Substances 0.000 claims description 25
- 238000003860 storage Methods 0.000 claims description 15
- 230000002452 interceptive effect Effects 0.000 claims description 13
- 230000008569 process Effects 0.000 claims description 11
- 230000011218 segmentation Effects 0.000 claims description 10
- 238000009432 framing Methods 0.000 claims description 9
- 238000013075 data extraction Methods 0.000 claims description 6
- 230000015572 biosynthetic process Effects 0.000 abstract description 25
- 238000003786 synthesis reaction Methods 0.000 abstract description 25
- 230000000694 effects Effects 0.000 abstract description 7
- 238000013473 artificial intelligence Methods 0.000 abstract description 4
- 238000010586 diagram Methods 0.000 description 29
- 238000005516 engineering process Methods 0.000 description 13
- 238000013461 design Methods 0.000 description 8
- 239000000243 solution Substances 0.000 description 8
- 238000001746 injection moulding Methods 0.000 description 7
- 230000006854 communication Effects 0.000 description 6
- 238000004891 communication Methods 0.000 description 5
- 238000002372 labelling Methods 0.000 description 4
- 210000004218 nerve net Anatomy 0.000 description 4
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 239000000203 mixture Substances 0.000 description 3
- 241000208340 Araliaceae Species 0.000 description 2
- 235000005035 Panax pseudoginseng ssp. pseudoginseng Nutrition 0.000 description 2
- 235000003140 Panax quinquefolius Nutrition 0.000 description 2
- 230000008878 coupling Effects 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000018109 developmental process Effects 0.000 description 2
- 235000008434 ginseng Nutrition 0.000 description 2
- 238000012905 input function Methods 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 239000004973 liquid crystal related substance Substances 0.000 description 2
- 230000007774 longterm Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000010295 mobile communication Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000011160 research Methods 0.000 description 2
- 230000000452 restraining effect Effects 0.000 description 2
- 235000010149 Brassica rapa subsp chinensis Nutrition 0.000 description 1
- 235000000536 Brassica rapa subsp pekinensis Nutrition 0.000 description 1
- 241000499436 Brassica rapa subsp. pekinensis Species 0.000 description 1
- 240000008067 Cucumis sativus Species 0.000 description 1
- 235000010799 Cucumis sativus var sativus Nutrition 0.000 description 1
- 241000196324 Embryophyta Species 0.000 description 1
- 230000001133 acceleration Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- ZPUCINDJVBIVPJ-LJISPDSOSA-N cocaine Chemical compound O([C@H]1C[C@@H]2CC[C@@H](N2C)[C@H]1C(=O)OC)C(=O)C1=CC=CC=C1 ZPUCINDJVBIVPJ-LJISPDSOSA-N 0.000 description 1
- 210000001072 colon Anatomy 0.000 description 1
- 239000000470 constituent Substances 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000005314 correlation function Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000005611 electricity Effects 0.000 description 1
- 238000004134 energy conservation Methods 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 230000002401 inhibitory effect Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 230000035772 mutation Effects 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 210000002569 neuron Anatomy 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 230000032696 parturition Effects 0.000 description 1
- 230000002085 persistent effect Effects 0.000 description 1
- 238000002360 preparation method Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 238000000926 separation method Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000010897 surface acoustic wave method Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/02—Methods for producing synthetic speech; Speech synthesisers
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Landscapes
- Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
This application discloses a kind of methods of prosody hierarchy mark, this method is applied to artificial intelligence field, speech synthesis field is particularly applicable to, this method comprises: obtaining text data and audio data to be marked, wherein, there is corresponding relationship between text data and audio data to be marked;The text feature set to be marked of each word is extracted according to text data to be marked;The acoustic feature set of each word is extracted according to audio data;According to the term identification of each word, the text feature set to be marked of each word and the acoustic feature set of each word, prosody hierarchy structure is obtained by prosody hierarchy marking model.Disclosed herein as well is a kind of method of model training, prosody hierarchy annotation equipment and model training apparatus.The application combines text feature and acoustic feature establishes prosody hierarchy marking model, and feature more abundant can be provided for the mark of prosody hierarchy, improves the accuracy of prosody hierarchy mark, promotes the effect of speech synthesis.
Description
The application be submission on January 22nd, 2019 Patent Office of the People's Republic of China, application No. is 201910060152.3, it is entitled
The divisional application of the Chinese patent application of " a kind of method, the method and device of model training of prosody hierarchy mark ".
Technical field
This application involves the methods of intelligent sound synthesis field more particularly to a kind of prosody hierarchy mark, model training
Method and relevant apparatus.
Background technique
In order to realize the speech synthesis system of high quality, magnanimity has accurately marked the data of prosody hierarchy structure to Guan Chong
It wants, prosody hierarchy structure is modeled to the rhythm of voice and its pause, and one being capable of accurate automatic marking prosody hierarchy knot
The method of structure, to the naturalness important in inhibiting of rapid build corpus of speech synthesis, promotion speech synthesis.
Currently, needing prosody hierarchy structure automatic marking with machine learning method one automatic marking model of training, In
There are mainly two types of in feature selecting, one is with text feature, first segments, then extract the text feature of word, pass through engineering
Learning method adjudicates the prosody hierarchy structure type of the word, and another kind is to use acoustic feature, mainly by the pause of detection audio
Different prosody hierarchy structure types is distinguished according to the size of pause duration in position.
However, marking task merely with text data, then before not accounting for prosody hierarchy structure boundary under actual conditions
One syllable when with extend and intonation phrasal boundary at often with pausing in short-term the phenomenon that, and merely with acoustics spy
Sign, it is difficult to three layers of prosody hierarchy structure are accurately marked simultaneously, and have ignored the inner link of text feature and acoustic feature,
The mark effect for reducing prosody hierarchy structure, affects the quality for the corpus that speech synthesis is relied on.
Summary of the invention
The embodiment of the present application provides method, the method and device of model training of a kind of prosody hierarchy mark, joint text
Eigen and acoustic feature establish prosody hierarchy marking model, and feature more abundant can be provided for prosody hierarchy, use
More accurate prosody hierarchy marking model can be improved the accuracy of prosody hierarchy mark, promote the effect of speech synthesis.
In view of this, the application first aspect provides a kind of method of prosody hierarchy mark, comprising:
Obtain text data and audio data to be marked, wherein the text data to be marked and the audio data
Between there is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a word mark
Know;
The text feature set to be marked of each word is extracted according to the text data to be marked, wherein described wait mark
Explanatory notes eigen set includes punctuate type after part of speech, word be long and word;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set
Including acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and described every
The acoustic feature set of a word obtains prosody hierarchy structure by prosody hierarchy marking model.
The application second aspect provides a kind of method of model training, comprising:
Obtain to training text data and audio data train, wherein it is described to training text data with it is described to
There is corresponding relationship between training audio data, it is described to include at least one word, each word pair in training text data
Ying Yuyi term identification;
According to it is described to training text data extract each word to training text characteristic set, wherein it is described wait instruct
Practicing text feature set includes punctuate type after part of speech, word be long and word;
The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described
After training acoustic feature set includes final syllable duration, word between pause duration, final syllable acoustics statistical nature and word
Acoustic feature changing value;
To the corresponding term identification of each word, each word to training text characteristic set and described
The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy mark
Injection molding type is for being labeled prosody hierarchy structure.
The application third aspect provides a kind of prosody hierarchy annotation equipment, comprising:
Obtain module, for obtaining text data and audio data to be marked, wherein the text data to be marked with
There is corresponding relationship, the text data to be marked includes at least one word, and each word is corresponding between the audio data
In a term identification;
Extraction module, for according to it is described acquisition module obtain the text data to be marked extract each word to
Mark text feature set, wherein the text feature set to be marked includes punctuate type after part of speech, word be long and word;
The extraction module is also used to extract each word according to the audio data that the acquisition module obtains
Acoustic feature set, wherein the acoustic feature set includes pause duration, final syllable sound after final syllable duration, word
Learn acoustic feature changing value between statistical nature and word;
Prediction module, each word extracted for term identification, the extraction module according to each word
The text feature set to be marked of language and the acoustic feature set of each word, are obtained by prosody hierarchy marking model
Prosody hierarchy structure.
In a kind of possible design, in the first implementation of the third aspect of the embodiment of the present application,
The prediction module, specifically for by the prosody hierarchy marking model determine rhythm word, prosodic phrase and
At least one of intonation phrase;
Or,
Rhythm word and/or prosodic phrase are determined by the prosody hierarchy marking model.
The application fourth aspect provides a kind of model training apparatus, comprising:
Module is obtained, for obtaining to training text data and audio data to be trained, wherein described to training text
There is corresponding relationship between data and the audio data to be trained, it is described to include at least one word in training text data
Language, each word correspond to a term identification;
Extraction module, for according to it is described acquisition module obtain described in training text data extract each word to
Training text characteristic set, wherein it is described after training text characteristic set include punctuate type after part of speech, word be long and word;
It is described every to be also used to the audio data extraction to be trained according to acquisition module acquisition for the extraction module
The acoustic feature set to be trained of a word, wherein described to stop after training acoustic feature set includes final syllable duration, word
Immediately acoustic feature changing value between long, final syllable acoustics statistical nature and word;
Training module, it is described each for being extracted to the corresponding term identification of each word, the extraction module
The acoustic feature set to be trained to training text characteristic set and each word of word is trained, and obtains the rhythm
Level marking model, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure.
In a kind of possible design, in the first implementation of the fourth aspect of the embodiment of the present application, the mould
Type training device further includes processing module and generation module;
The processing module, for acquisitions module acquisition to training text data and audio data to be trained it
Afterwards, word segmentation processing is carried out to training text data to described, obtains at least one word;
The acquisition module is also used to the target word mark according to corresponding to default term identification Relation acquisition target word
Know, wherein the default term identification relationship is used to indicate the relationship between preset each word and term identification, institute
State any one word that target word belongs at least one described word that the processing module is handled;
The generation module, for generate it is described to target word corresponding to target word described in training text data to
Amount;
The training module, specifically for the target term identification obtained to the acquisition module and the generation
The target term vector that module generates is trained, and obtains the first model parameter, wherein first model parameter is for giving birth to
At the word embeding layer in the prosody hierarchy marking model.
In a kind of possible design, in second of implementation of the fourth aspect of the embodiment of the present application,
The extraction module, specifically for obtain it is described to the part of speech of target word, word in training text data it is long and
Punctuate type after word, wherein the part of speech indicates the syntactic category of word as a result, the long number of words for indicating word of institute's predicate, described
Punctuate type is used to indicate punctuate type corresponding after word after word;
Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein described
Related word is the word for having incidence relation with the target word;
The training module, specifically for the part of speech to the target word, punctuate type and described after word is long and word
Punctuate type is trained after the part of speech of related word, word be long and word, obtains the second model parameter, second model parameter
For generating the text neural network in the prosody hierarchy marking model.
In a kind of possible design, in the third implementation of the fourth aspect of the embodiment of the present application, the mould
Type training device further includes alignment module;
The alignment module, for acquisitions module acquisition to training text data and audio data to be trained it
Afterwards, pressure alignment is carried out to training text data and the audio data to be trained to described, obtains the text of time unifying;
The extraction module, when final syllable specifically for determining the target word according to the time unifying text
It is long.
In a kind of possible design, in the 4th kind of implementation of the fourth aspect of the embodiment of the present application,
The extraction module, when pausing after the word specifically for determining the target word according to the time unifying text
It is long.
In a kind of possible design, in the 5th kind of implementation of the fourth aspect of the embodiment of the present application,
The extraction module, specifically for being mentioned according to the time unifying text and in the audio data to be trained
The frame number and voiced sound end frame of the final syllable voiced sound start frame of the target word is calculated in the fundamental frequency information taken
Frame number;
Extract the logarithm fundamental curve and logarithmic energy curve of the audio data to be trained;
According to the frame number of the final syllable voiced sound start frame of the target word, the frame number, described of voiced sound end frame
The final syllable acoustics statistical nature of the target word is calculated in logarithm fundamental curve and logarithmic energy curve, wherein
The final syllable acoustics statistical nature include the maximum value of the logarithm fundamental curve, minimum value, interval range, average value with
At least one of and in variance, the final syllable acoustics statistical nature further includes the maximum value, most of the logarithmic energy curve
At least one of in small value, interval range, average value and variance.
In a kind of possible design, in the 6th kind of implementation of the fourth aspect of the embodiment of the present application,
The extraction module, specifically for being mentioned according to the time unifying text and in the audio data to be trained
The fundamental frequency information taken is calculated under the frame number and the target word of the last one unvoiced frame of the target word
The frame number of the unvoiced frame of one adjacent word prefix;
According to the frame number of the target word the last one unvoiced frame and next adjacent word of the target word
The frame number of the unvoiced frame of words and phrases head, and fundamental frequency information and the energy letter that framing is extracted from the audio data to be trained
Breath determines suffix unvoiced frame and next adjacent word prefix unvoiced frame fundamental frequency value and the energy between this of the target word
Magnitude;
According to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame fundamental frequency value between this,
The log difference of fundamental frequency value is calculated, it is turbid according to the suffix unvoiced frame of the target word and next adjacent word prefix
Sound frame energy value between this, is calculated the log difference of energy value, wherein the log difference and the energy of the fundamental frequency value
The log difference of magnitude belongs to acoustic feature changing value between institute's predicate.
In a kind of possible design, in the 7th kind of implementation of the fourth aspect of the embodiment of the present application,
The training module, specifically for obtaining target word by the word embeding layer in the prosody hierarchy marking model
First output result of mark, wherein the target term identification corresponds to target word, the target word belong to it is described extremely
Any one word in a few word, institute's predicate embeding layer are obtained according to the training of the first model parameter;
Target, which is obtained, by the text neural network in the prosody hierarchy marking model waits for training text characteristic set
Second output result, wherein the target waits for that training text characteristic set corresponds to the target word, the text nerve net
Network is obtained according to the training of the second model parameter;
The first output result, the second output result and target acoustic feature set to be trained are instructed
To practice, obtains third model parameter, wherein the target acoustic feature set to be trained is corresponding to the target word, and described the
Three model parameters are used to generate the acoustics neural network in the prosody hierarchy marking model;
According to first model parameter, second model parameter and the third model parameter, the rhythm is generated
Restrain level marking model.
The aspect of the application the 5th provides a kind of prosody hierarchy annotation equipment, comprising: memory, transceiver, processor and
Bus system;
Wherein, the memory is for storing program;
The processor is used to execute the program in the memory, includes the following steps:
Obtain text data and audio data to be marked, wherein the text data to be marked and the audio data
Between there is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a word mark
Know;
The text feature set to be marked of each word is extracted according to the text data to be marked, wherein described wait mark
Explanatory notes eigen set includes punctuate type after part of speech, word be long and word;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set
Including acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and described every
The acoustic feature set of a word obtains prosody hierarchy structure by prosody hierarchy marking model;
The bus system is for connecting the memory and the processor, so that the memory and the place
Reason device is communicated.
The 6th aspect of the application provides a kind of model training apparatus, comprising: memory, transceiver, processor and bus
System;
Wherein, the memory is for storing program;
The processor is used to execute the program in the memory, includes the following steps:
Obtain to training text data and audio data train, wherein it is described to training text data with it is described to
There is corresponding relationship between training audio data, it is described to include at least one word, each word pair in training text data
Ying Yuyi term identification;
According to it is described to training text data extract each word to training text characteristic set, wherein it is described wait instruct
Practicing text feature set includes punctuate type after part of speech, word be long and word;
The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described
After training acoustic feature set includes final syllable duration, word between pause duration, final syllable acoustics statistical nature and word
Acoustic feature changing value;
To the corresponding term identification of each word, each word to training text characteristic set and described
The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy mark
Injection molding type is for being labeled prosody hierarchy structure;
The bus system is for connecting the memory and the processor, so that the memory and the place
Reason device is communicated.
The 7th aspect of the application provides a kind of computer readable storage medium, in the computer readable storage medium
It is stored with instruction, when run on a computer, so that computer executes method described in above-mentioned various aspects.
As can be seen from the above technical solutions, the embodiment of the present application has the advantage that
In the embodiment of the present application, provide a kind of method of prosody hierarchy mark, firstly, obtain text data to be marked with
And audio data, wherein there is corresponding relationship, text data to be marked includes extremely between text data and audio data to be marked
A few word, each word correspond to a term identification, then according to text data to be marked extract each word to
Marking text feature set, wherein text feature set to be marked includes punctuate type after part of speech, word be long and word, further according to
Audio data extracts the acoustic feature set of each word, wherein acoustic feature set includes pausing after final syllable duration, word
Acoustic feature changing value between duration, final syllable acoustics statistical nature and word, finally according to the term identification of each word, every
The text feature set to be marked of a word and the acoustic feature set of each word, are obtained by prosody hierarchy marking model
Prosody hierarchy structure.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, Ke Yiwei
Prosody hierarchy structure mark provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved the rhythm
The accuracy of level mark is conducive to the naturalness for promoting speech synthesis sound quality.
Detailed description of the invention
Fig. 1 is a configuration diagram of speech synthesis system in the embodiment of the present application;
Fig. 2 is a structural schematic diagram of prosody hierarchy in the embodiment of the present application;
Fig. 3 is method one embodiment schematic diagram of prosody hierarchy mark in the embodiment of the present application;
Fig. 4 is an application demonstration schematic diagram of prosody hierarchy labeling system in the embodiment of the present application;
Fig. 5 is a flow diagram of prosody hierarchy mark in the embodiment of the present application;
Fig. 6 is method one embodiment schematic diagram of model training in the embodiment of the present application;
Fig. 7 is the flow diagram that acoustic feature set is extracted in the embodiment of the present application;
Fig. 8 is one embodiment schematic diagram of fundamental curve in the embodiment of the present application;
Fig. 9 is one embodiment schematic diagram of energy curve in the embodiment of the present application;
Figure 10 is a structural schematic diagram of prosody hierarchy marking model in the embodiment of the present application;
Figure 11 is one embodiment schematic diagram of prosody hierarchy annotation equipment in the embodiment of the present application;
Figure 12 is one embodiment schematic diagram of model training apparatus in the embodiment of the present application;
Figure 13 is another embodiment schematic diagram of model training apparatus in the embodiment of the present application;
Figure 14 is another embodiment schematic diagram of model training apparatus in the embodiment of the present application;
Figure 15 is a structural schematic diagram of terminal device in the embodiment of the present application;
Figure 16 is a structural schematic diagram of server in the embodiment of the present application.
Specific embodiment
The embodiment of the present application provides method, the method and device of model training of a kind of prosody hierarchy mark, joint text
Eigen and acoustic feature establish prosody hierarchy marking model, can provide spy more abundant for the mark of prosody hierarchy
Sign, the more accurate prosody hierarchy marking model of use can be improved the accuracy of prosody hierarchy mark, promote speech synthesis
Effect.
The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing
The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage
The data that solution uses in this way are interchangeable under appropriate circumstances, so that embodiments herein described herein for example can be to remove
Sequence other than those of illustrating or describe herein is implemented.In addition, term " includes " and " corresponding to " and their times
What is deformed, it is intended that cover it is non-exclusive include, for example, contain the process, method of a series of steps or units, system,
Product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for
The intrinsic other step or units of these process, methods, product or equipment.
It should be understood that the application is mainly used in the Data Preparation Process of building corpus of speech synthesis to text data
Carry out the automatic marking of prosody hierarchy structure.Speech synthesis is to convert text to the task of voice, and construct the language of high quality
Sound synthesis system needs to prepare the data of magnanimity, wherein the data with prosody hierarchy structure mark have the naturalness of speech synthesis
Great influence, traditional notation methods often artificial mark it is time-consuming and laborious for being labeled to mass data and different
Mark personnel can have inconsistency on the mark of some words, and the system of an automatic marking prosody hierarchy structure is to quick
The magnanimity prosody hierarchy data mark task of building speech synthesis system and the inconsistency for solving different labeled personnel have
Significance.
The key technology of voice technology (Speech Technology) has automatic speech recognition technology (ASR) and voice to close
At technology (TTS) and sound groove recognition technology in e.It allows computer capacity to listen, can see, can say, can feel, being the hair of the following human-computer interaction
Direction is opened up, wherein voice becomes following one of the man-machine interaction mode being most expected.
In order to make it easy to understand, present applicant proposes a kind of method of prosody hierarchy mark and the method for model training, it should
Method is applied to speech synthesis system shown in FIG. 1, referring to Fig. 1, Fig. 1 is speech synthesis system in the embodiment of the present application
One configuration diagram, as shown, terminal device or server obtain text data and audio data first, wherein
Text data and audio data are on corresponding, for example text data is " today is an auspicious day ", and audio data is then " modern
It is auspicious day " audio, using forcing alignment tool to be aligned text data and audio data.Next it needs to extract text
Text feature set corresponding to each word in notebook data, include in the text feature set of each word part of speech, word it is long and
Punctuate type after word.At the same time, it is also necessary to carry out feature extraction also to audio data to get the acoustic feature of each word is arrived
Set, the acoustic feature set of each word includes pause duration, final syllable acoustics statistical nature after final syllable duration, word
And acoustic feature changing value between word, wherein acoustic feature changing value is expressed as the last unvoiced frame of current term under between word
Log difference of the unvoiced frame of one word prefix on fundamental frequency, and the log difference on energy.In addition, according to text data
The term identification (identification, ID) that each word can also be extracted, by the word mark of word each in entire sentence
The acoustic feature set of knowledge, the text feature set of each word and each word is input to trained fascicule
Grade marking model exports fascicule annotation results by the model.If prosody hierarchy marking model is deployed in terminal device, that
It, can be directly according to prosody hierarchy structure after the terminal device obtains prosody hierarchy structure by prosody hierarchy marking model
Play corresponding sentence.If prosody hierarchy marking model is disposed in the server, which passes through prosody hierarchy mark
After injection molding type obtains prosody hierarchy structure, need by prosody hierarchy structural feedback to terminal device, by terminal device according to rhythm
It restrains hierarchical structure and plays corresponding sentence.
It should be noted that terminal device include but be not limited only to tablet computer, laptop, palm PC, mobile phone,
Interactive voice equipment and PC (personal computer, PC), herein without limitation.Wherein, interactive voice equipment packet
Contain but be not limited only to intelligent sound and intelligent appliance.Interactive voice equipment also has a characteristic that
1, networked function, various interactive voice equipment can also pass through family by LAN connection to together
Gateway interface is connected with the services sites of manufacturer, may finally be connected with internet, realizes the shared of information.
2, intelligent, interactive voice equipment can respond automatically according to the difference of ambient enviroment, not need artificially dry
In advance.
3, open and compatibility, since the interactive voice equipment of user may be from different manufacturers, interactive voice
Equipment needs to have explorative and compatibility.
4, energy-saving, intelligent appliance can be according to ambient enviroment adjust automatically working time and working conditions, thus real
Existing energy conservation.
5, ease for use, since complicated control operating process is solved by the controller being embedded in interactive voice equipment,
Therefore user need to only understand very shirtsleeve operation.Interactive voice equipment is not singly to refer to some equipment, and should be a skill
Art system, with the continuous development of mankind's application demand and interactive voice device intelligence, content will more be enriched, according to
The function of the different phonetic interactive device of actual application environment also can difference, but should generally have intelligent control technology.
It should be understood that the prosody hierarchy structure of speech synthesis system output specifically can be the prosody hierarchy of Chinese, Chinese is made
There is tune language for one kind, prosodic features is extremely complex.Prosody hierarchy structure is the prosodic features such as pause and rhythm to voice
Modeling, to speech synthesis system synthesis sound quality naturalness have great significance.A kind of typical prosody hierarchy structure stroke
Point as shown in Fig. 2, referring to Fig. 2, Fig. 2 be the embodiment of the present application in prosody hierarchy a structural schematic diagram, bottom-up point
For rhythm word (prosodic word, PW), prosodic phrase (prosodic phrase, PPH) and intonation phrase
(intonational phrase, IPH).For example, sentence " showing best regards and hearty blessing " is inner, PW is " showing ",
" sincere ", " greeting ", " and hearty ", " blessing ".PPH is " showing ", " best regards ", " and hearty blessing ".IPH
For " showing best regards ", " and hearty blessing ".
In conjunction with above-mentioned introduction, the method marked to prosody hierarchy in the application is introduced below, referring to Fig. 3, this
Method one embodiment of prosody hierarchy mark includes: in application embodiment
101, text data and audio data to be marked are obtained, wherein between text data and audio data to be marked
With corresponding relationship, text data to be marked includes at least one word, and each word corresponds to a term identification;
In the present embodiment, text data to be marked and corresponding audio data are obtained first, wherein textual data to be marked
According to specifically can be in short, perhaps one section of words language form includes but is not limited only to Chinese, Japanese, English or Korean.Sound
Frequency can be an audio file according to specific.It include at least one word in text data to be marked, it is possible thereby to be divided
Word, such as " showing best regards and hearty blessing " can be divided into following five words, respectively " show ", " sincere
", " greeting ", " and hearty ", " blessing ", and different words corresponds to different term identifications.
102, the text feature set to be marked of each word is extracted according to text data to be marked, wherein text to be marked
Eigen set includes punctuate type after part of speech, word be long and word;
In the present embodiment, feature extraction then is carried out to each word respectively, feature extraction includes two aspects, and first is
The extraction of text feature, second is the extraction of acoustic feature.During extracting text feature, need to textual data to be marked
Each word carries out the extraction of text feature in, with text data to be marked " showing best regards and hearty blessing "
For, text feature set to be marked corresponding to each word can be extracted, text feature set to be marked here includes
But it is not limited only to punctuate type after part of speech, word be long and word.
Wherein, part of speech is generally divided into notional word and function word, and notional word is one of Chinese parts of speech, contains in word and is of practical significance
Word, notional word can individually serve as sentence element, that is, have the word of lexical meaning and grammatical meaning.Using grammatical function as mainly according to
According to, it is believed that it can individually serve as syntactic constituent, lexical meaning and grammatical meaning are notional words.Notional word includes noun, verb, shape
Hold word, number, quantifier and pronoun.Function word refers to the vocabulary of not complete meaning, but has the word of grammatical meaning or function.With must
Notional word or sentence must be depended on, grammatical meaning is indicated, cannot individually form a complete sentence, cannot individually make grammatical item, and cannot be overlapped
Feature.Function word includes adverbial word, preposition, conjunction, auxiliary word, interjection, onomatopoeia.
The long length for indicating word of word, such as the word a length of 2 of " greeting ", "and" word a length of 1.
Punctuate type indicates whether closely follow punctuation mark behind this word after word, if also needed really immediately following punctuation mark
Recognize the type of the punctuation mark.For punctuation mark in spoken word, showing is exactly the dead time, and it is true to may also help in people
It expresses thoughts and feelings with cutting and understands written language.
103, the acoustic feature set of each word is extracted according to audio data, wherein acoustic feature set includes suffix
Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after syllable duration, word;
In the present embodiment, during extracting acoustic feature, need to carry out acoustics to word each in audio data
The extraction of feature can extract five groups of acoustics by taking text data to be marked " showing best regards and hearty blessing " as an example
Characteristic set, acoustic feature set here include but are not limited only to pause duration, final syllable sound after final syllable duration, word
Learn acoustic feature changing value between statistical nature and word.
Wherein, final syllable duration refers to the time span of word ultima, which refers to the syllable of voiced sound,
Such as " time " word of " greeting ", pronounce for " hou ", voiceless sound is " h ", and voiced sound is " ou ", and final syllable duration just refers to hair
The time span of " ou " this syllable, this when detected with dedicated tool, be not unfolded to discuss herein.
Pause duration refers to length this period before starting after this word is finished to next word after word, than
Such as interval time length between " greeting " and "and".
Final syllable acoustics statistical nature generally includes ten parameters, and five parameters therein are with final syllable to base
The parameter of frequency curvilinear correlation includes the maximum value based on logarithm fundamental curve, the minimum value based on logarithm fundamental curve, is based on
The range of logarithm fundamental curve, the mean value based on logarithm fundamental curve and the variance based on logarithm fundamental curve.It is therein another
Outer five parameters are the parameters with final syllable logarithmic energy curvilinear correlation, that is, include minimum value based on logarithmic energy curve,
Maximum value based on logarithmic energy curve, based on the range based on logarithmic energy curve, the mean value based on logarithmic energy curve with
And the variance based on logarithmic energy curve.
Between word acoustic feature changing value indicate between word tail portion voiced sound and the first voiced sound of next word to base
Frequency difference value and logarithmic energy difference.
104, according to the term identification of each word, the text feature set to be marked of each word and each word
Acoustic feature set obtains prosody hierarchy structure by prosody hierarchy marking model.
In the present embodiment, text data to be marked and audio data are input to prosody hierarchy marking model, the rhythm
Level marking model is according to the term identification of each word, the text feature set to be marked of each word and each word
Acoustic feature set exports corresponding prosody hierarchy structure.
For the ease of introducing, referring to Fig. 4, Fig. 4 is an application of prosody hierarchy labeling system in the embodiment of the present application
Schematic diagram is demonstrated, as shown, user provides the text data and audio data for needing to mark prosody hierarchy structure, such as user
The text data to be marked that input needs to mark is " showing best regards and hearty blessing ", then by the text to be marked
Data and corresponding audio data are supplied to prosody hierarchy marking model.Feature is extracted by prosody hierarchy marking model.I.e.
The text feature set to be marked of each word and the acoustic feature set of each word are extracted respectively, recycle depth nerve
Network forward calculation obtains prosody hierarchy structure, and prosody hierarchy marking model mentions the text for having marked prosody hierarchy structure
Supply user.
Referring to Fig. 5, Fig. 5 is a flow diagram of prosody hierarchy mark in the embodiment of the present application, as shown, tool
Body, in step S1, the text data and audio number information of sentence to be marked are obtained first.In step S2, to text data into
Text data is carried out forcing to be aligned by row word segmentation processing with audio data.In step S3, text data and audio data are carried out by force
After system alignment, corresponding text feature and acoustic feature can be extracted.It is in step S4, the text feature of extraction and acoustics is special
Sign is input to prosody hierarchy marking model, which includes feedforward neural network and two-way length nerve net in short-term
Network.In step S5, by the prosody hierarchy structure of prosody hierarchy marking model output sentence.
In the embodiment of the present application, provide a kind of method of prosody hierarchy mark, firstly, obtain text data to be marked with
And audio data, wherein there is corresponding relationship, text data to be marked includes extremely between text data and audio data to be marked
A few word, each word correspond to a term identification, then according to text data to be marked extract each word to
Marking text feature set, wherein text feature set to be marked includes punctuate type after part of speech, word be long and word, further according to
Audio data extracts the acoustic feature set of each word, wherein acoustic feature set includes pausing after final syllable duration, word
Acoustic feature changing value between duration, final syllable acoustics statistical nature and word, finally according to the term identification of each word, every
The text feature set to be marked of a word and the acoustic feature set of each word, are obtained by prosody hierarchy marking model
Prosody hierarchy structure.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, Ke Yiwei
The mark of prosody hierarchy provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved fascicule
The accuracy of grade mark, and promote the effect of speech synthesis.
Optionally, on the basis of above-mentioned Fig. 3 corresponding embodiment, the embodiment of the present application provides the side of prosody hierarchy mark
In first alternative embodiment of method, prosody hierarchy structure is obtained by prosody hierarchy marking model, may include:
At least one of rhythm word, prosodic phrase and intonation phrase is determined by prosody hierarchy marking model;
Or, determining rhythm word and/or prosodic phrase by prosody hierarchy marking model.
In the present embodiment, two kinds of common prosody hierarchy structures will be introduced.In the first case, pass through prosody hierarchy mark
Injection molding type determines at least one of rhythm word, prosodic phrase and intonation phrase, that is to say, that the training of prosody hierarchy marking model
Four kinds of situations, respectively non-prosody hierarchy structure boundary, rhythm word boundary, prosodic phrase boundary and intonation phrasal boundary.
In the first case, rhythm word and/or prosodic phrase are determined by prosody hierarchy marking model, that is to say, that prosody hierarchy mark
Injection molding type has trained three kinds of situations, respectively non-prosody hierarchy structure boundary, rhythm word boundary and prosodic phrase boundary.
The application is when marking prosody hierarchy, after the prosody hierarchy marking model generated using the training stage is to text-processing
Input text data carry out prosody hierarchy mark, so that the text of prosody hierarchy structure has been marked, to quick structure
Build corpus needed for speech synthesis system.
Secondly, describing two kinds of common prosody hierarchy mask methods in the embodiment of the present application, one kind is to pass through fascicule
Grade marking model determines that rhythm word, prosodic phrase and intonation phrase, another kind are to determine rhythm by prosody hierarchy marking model
Restrain word and prosodic phrase.By the above-mentioned means, user can choose finer rhythm word, prosodic phrase and intonation phrase
The labelling schemes of three layers of prosody hierarchy structure also can choose two layers of prosody hierarchy structure mark of rhythm word and prosodic phrase
Scheme.Therefore, this programme can also select the prosody hierarchy of output according to demand, thus the flexibility of lifting scheme.
In conjunction with above-mentioned introduction, the method for model training in the application will be introduced below, referring to Fig. 6, the application
Method one embodiment of model training includes: in embodiment
201, obtain to training text data and audio data to be trained, wherein to training text data with wait train
There is corresponding relationship, to include at least one word in training text data, each word is corresponding to one between audio data
Term identification;
In the present embodiment, obtain first to training text data and corresponding audio data to be trained, wherein wait train
Text data specifically can be a word, and perhaps one section of words language form includes but is not limited only to Chinese, Japanese, English or Korea Spro
Text.Audio data to be trained specifically can be an audio file.It is including at least one word to training text data, thus
It can be segmented, such as " showing best regards and hearty blessing ", following five words can be divided into, respectively " be caused
With ", " sincere ", " greeting ", " and hearty ", " blessing ", and different words corresponds to different term identifications.
It is understood that training when generally require a large amount of sample, here to training text data and to
Training audio data is sample, will be to training text data and audio data to be trained as one for the ease of introducing
Sample is illustrated, this should not be construed as the restriction to the application.
202, according to training text data extract each word to training text characteristic set, wherein to training text
Eigen set includes punctuate type after part of speech, word be long and word;
In the present embodiment, feature extraction then is carried out to each word respectively, feature extraction includes two aspects, and first is
The extraction of text feature, second is the extraction of acoustic feature.During extracting text feature, need to treat training text number
Each word carries out the extraction of text feature in, to training text data " showing best regards and hearty blessing "
For, can extract to training text characteristic set corresponding to each word, here include to training text characteristic set
But it is not limited only to punctuate type after part of speech, word be long and word.
It should be noted that punctuate type is introduced in the above-described embodiments after part of speech, word length and word, therefore herein
It does not repeat.
203, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, wherein sound to be trained
Learning characteristic set includes pause duration after final syllable duration, word, acoustic feature between final syllable acoustics statistical nature and word
Changing value;
In the present embodiment, during extracting acoustic feature, need to treat in trained audio data each word into
The extraction of row acoustic feature, for training text data " showing best regards and hearty blessing ", can extract to
The acoustic feature set of training word, acoustic feature set to be trained here include but are not limited only to final syllable duration, word
Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word afterwards.
It should be noted that sound between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word
It learns changing features value to be introduced in the above-described embodiments, therefore does not repeat herein.
204, to the corresponding term identification of each word, each word to training text characteristic set and each word
Acoustic feature set to be trained be trained, obtain prosody hierarchy marking model, wherein prosody hierarchy marking model for pair
Prosody hierarchy structure is labeled.
In the present embodiment, in the training process for introducing prosody hierarchy marking model, training data is to have marked fascicule
The text data of level structure and corresponding audio data model sequence using deep neural network, have in each sentence
Multiple words, and a sentence is exactly a word sequence, by the feature and label of each word, as deep neural network one
The input and output of a time step, each word has corresponding label y, then the label of a sentence can be expressed as vector Y,
The term identification of each word, text feature and sound in sentence can be extracted from text data and corresponding audio data
Feature is learned, to constitute the feature x of this word, the multiple words of sentence, the input vector X that can be expressed as loses letter
Number is expressed as L (Y, f (X)), is trained by a large amount of sample and loss function is allowed to become smaller as far as possible, obtain the instruction of neural network
Practice parameter, to obtain the model of prosody hierarchy structure automatic marking, i.e. prosody hierarchy marking model.
In the embodiment of the present application, provide a kind of method of model training, firstly, obtain to training text data and to
Training audio data, wherein to have corresponding relationship between training text data and audio data to be trained, each word is corresponding
In a term identification, then according to training text data extract each word to training text characteristic set, wherein to
Training text characteristic set includes punctuate type after part of speech, word be long and word, and is extracted each according to audio data train
The acoustic feature set to be trained of word, wherein when pausing after training acoustic feature set includes final syllable duration, word
Acoustic feature changing value between long, final syllable acoustics statistical nature and word, finally to the corresponding term identification of each word, every
The acoustic feature set to be trained to training text characteristic set and each word of a word is trained, and obtains fascicule
Grade marking model.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, it can be rhythm
The mark for restraining level provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved prosody hierarchy
The accuracy of mark, and promote the effect of speech synthesis.
Optionally, on the basis of above-mentioned Fig. 6 corresponding embodiment, the embodiment of the present application provides the method the of model training
In one alternative embodiment, after obtaining to training text data and audio data to be trained, can also include:
It treats training text data and carries out word segmentation processing, obtain at least one word;
According to target term identification corresponding to default term identification Relation acquisition target word, wherein default word mark
For knowledge relationship for indicating the relationship between preset each word and term identification, target word belongs at least one word
In any one word;
It generates to target term vector corresponding to target word in training text data;
To the corresponding term identification of each word, each word to training text characteristic set and each word to
Training acoustic feature set is trained, and is obtained prosody hierarchy marking model, be may include:
Target term identification and target term vector are trained, the first model parameter is obtained, wherein the first model ginseng
Number is for generating the word embeding layer in prosody hierarchy marking model.
In the present embodiment, a kind of method of word embeding layer in training rhythm level marking model is provided.Firstly, it is necessary to obtain
Take to training text data, then to this wait for training text data carry out word segmentation processing, such as to training text data be " set up
Cooperative society forms new model by electric business platform ", it obtains " setting up " after participle, " cooperative society ", " by ", " electric business " " is put down
Platform ", " composition ", " new model ".This thing needs to determine term identification corresponding to each word according to default term identification relationship.
In order to make it easy to understand, please referring to table 1, table 1 is a signal of default term identification relationship.
Table 1
Term identification | Word |
0 | It sets up |
1 | Cooperative society |
2 | By |
3 | Electric business |
4 | Half platform |
5 | Composition |
6 | New model |
As shown in Table 1, term identification relationship is preset to be used to indicate the relationship between word and term identification, identical word
Corresponding to same term identification, it is assumed that target word is " establishment ", then the term identification of the word is " 0 ", at this time with " 0 "
Input as word embeding layer.
According to the method for above-mentioned generation target term identification and target term vector, generate other term identifications and word to
Amount, according to the mapping relations between term identification and term vector, is trained these term identifications and word vectors, utilizes damage
Available first model parameter of function minimum is lost, which is used to generate the word in prosody hierarchy marking model
Embeding layer.In practical applications, word embeding layer can be periodically updated, to promote the accuracy of word embeding layer.
Secondly, in the embodiment of the present application, a kind of method for describing trained word embeding layer first treats training text data
Carry out word segmentation processing, the then target term identification according to default term identification Relation acquisition target word corresponding to, generation to
Target term vector corresponding to target word in training text data, then target term identification and target term vector are instructed
Practice, obtain the first model parameter, wherein the first model parameter is used to generate the word embeding layer in prosody hierarchy marking model.It is logical
Aforesaid way is crossed, can directly train the word embeding layer obtained in prosody hierarchy marking model, and when training word embeding layer
Other neural networks that can also simultaneously in training rhythm level marking model, to save additionally with independent neural network
The process of training term vector model, improves training effectiveness.
Optionally, on the basis of above-mentioned Fig. 6 corresponding embodiment, the embodiment of the present application provides the method the of model training
In two alternative embodiments, according to training text data extract each word to training text characteristic set, may include:
Obtain the punctuate type after the part of speech of target word, word length and word in training text data, wherein part of speech indicates
The syntactic category of word as a result, the long number of words for indicating word of word, after word punctuate type be used to indicate after word corresponding to
Punctuate type;
Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein related word
To have the word of incidence relation with target word;
To the corresponding term identification of each word, each word to training text characteristic set and each word to
Training acoustic feature set is trained, and is obtained prosody hierarchy marking model, be may include:
It is marked after and word long to punctuate type after the part of speech, word length and word of target word and the part of speech of related word, word
Vertex type is trained, and obtains the second model parameter, and the second model parameter is used to generate the text in prosody hierarchy marking model
Neural network.
In the present embodiment, a kind of method of this Chinese neural network of training rhythm level marking model is provided.In order to just
In understanding, continues with and be introduced for the target word in training text data, it is to be understood that wait train
Other word processing modes in text data are similar to the processing mode of target word, do not repeat herein.
Specifically, training text data, which carry out word segmentation processing, to be waited for this first, such as to training text data for " establishment is closed
Make society, by electric business platform, form new model ", it obtains " setting up " after participle, " cooperative society ", " by ", " electric business ", " platform ",
" composition ", " new model ".Assuming that target word is " cooperative society ", then the part of speech of the target word is noun, word a length of 3, after word
Punctuate type is comma.In order to make it easy to understand, illustrating the relationship and word between part of speech and mark below in conjunction with table 2 and table 3
Relationship between punctuate type and mark afterwards.In practical applications, text feature lead to it is commonly used digital indicate, therefore, it is necessary to will
Text concept is converted to digital concept.
Table 2
Part of speech mark | Part of speech | Citing |
0 | Noun | " Shanghai, cucumber, Chinese cabbage, upper, tractor, quality, virtue " |
1 | Verb | " come, walk, running, paying attention to, learning, taking off, agreeing, recognizing " |
2 | Adjective | " more, thin, high, ugly, snow-white, beautiful, red " |
3 | Adverbial word | " very, quite, pole, very, just, all, at once, at once, once " |
4 | Pronoun | " I, you, you, he, she, it, we, you " |
5 | Preposition | ", from, to, court, be, in order to, it is past, in, ratio, quilt, " |
6 | Quantifier | " it is a, only, branch,, plant, " |
7 | Conjunction | " so, so and or " |
8 | Auxiliary word | ",, institute, eh,, " |
9 | Number | " one, two, two, three, seven, ten, hundred, thousand, ten thousand, hundred million " |
10 | Interjection | " feed, Ah, hi, groan, Oh, " |
11 | Onomatopoeia | " toot, it is tearful, rumble, chuckle, rustle it is husky, wha la la " |
Table 3
Punctuate type identification after word | Punctuate type after word | Citing |
0 | Fullstop | 。 |
1 | Question mark | |
2 | Exclamation mark | ! |
3 | Pause mark | 、 |
4 | Comma | , |
5 | Branch | ; |
6 | Colon | : |
7 | Without punctuate |
By table 2 and table 3 as it can be seen that when target word is " cooperative society ", corresponding feature is " 3 comma of noun ", can be with table
It is shown as " 034 ".For rich text feature, it is also necessary to consider the word on target word periphery, i.e. acquisition related word, association
Word can be the previous word of target word perhaps the latter word or the first two word etc., herein without limitation.
Assuming that related word is the previous word and the latter word of target word, and target word is " cooperative society ", then being associated with
Word is " establishment " and " by ".According to the content of table 2 and table 3 it is found that " establishment " corresponding feature is " verb 2 is without punctuate ".
Part of speech classification number, most major term length and the punctuate classification number in corpus are counted, then marking after the long feature of part of speech feature, word and word
Point feature can indicate that three only hot vectors splice to obtain the text feature of current goal word, target using solely hot vector
The text feature of word and the text feature of related word are spliced, and obtain the Text eigenvector of target word, i.e., wait instruct
Practice text feature set.
According to the method that said extracted waits for training text characteristic set, extract each word to training text feature set
It closes, to being trained to training text characteristic set for these words, utilizes available second model of loss function minimum value
Parameter, second model parameter are used to generate the text neural network in prosody hierarchy marking model.In practical applications, text
Neural network can be periodically updated, to promote the accuracy of text neural network.
It is understood that text neural network can be feedforward neural network or convolutional neural networks, it can be with it
The neural network of his type, memory network can be replaced two-way length by its mutation in short-term, the circulation mind such as with gating cycle unit
Through network, only one signal, is not construed as the restriction to the application herein.And the application does not limit text nerve net yet
The number of plies and neuron number of network.
Secondly, in the embodiment of the present application, a kind of method for describing training text neural network is first obtained to training text
Punctuate type after the part of speech of target word, word length and word in notebook data, and obtain to related word in training text data
Part of speech, punctuate type after word is long and word, the then punctuate type to the part of speech of target word, after word is long and word and is associated with
Punctuate type is trained after the part of speech of word, word be long and word, obtains the second model parameter, the second model parameter is for generating
Text neural network in prosody hierarchy marking model.By the above-mentioned means, system can be allowed to learn automatically by neural network
The high-level characteristic expression for being conducive to prosody hierarchy structure mark, learns from the text feature set being originally inputted to advantageous automatically
In the advanced features of mark, to promote the performance of prosody hierarchy structure automatic marking.
Optionally, on the basis of above-mentioned Fig. 6 corresponding embodiment, the embodiment of the present application provides the method the of model training
In three alternative embodiments, after obtaining to training text data and audio data to be trained, can also include:
It treats training text data and audio data to be trained carries out pressure alignment, obtain the text of time unifying;
The acoustic feature set to be trained that each word is extracted according to audio data to be trained may include:
The final syllable duration of target word is determined according to time unifying text.
In the present embodiment, will introduce how to extract the acoustic feature set of word, that is, treat training text data and to
Training audio data carries out pressure alignment, obtains the text of time unifying, the frame boundaries of specific available phoneme level, then word
The frame boundaries of last or end syllable section can also obtain, and by the starting frame number and end frame number of final syllable, target word is calculated
The final syllable duration of language.
For the ease of introducing, referring to Fig. 7, Fig. 7 is a process for extracting acoustic feature set in the embodiment of the present application
Schematic diagram first obtains text data and audio data as shown, in step A1, specifically can be to training text data with
And audio data to be trained.In step A2, treats training text data and carry out word segmentation processing, and by text data and audio number
According to use force alignment tool obtain the text of time unifying to get arrive phoneme level boundary information.In step A4, word is determined
Start-stop frame number corresponding to the boundary of last or end syllable section, equally, the last one unvoiced frame of suffix and next word prefix first are turbid
The frame number of sound frame also can determine that.In step A3, logarithm fundamental curve and logarithmic energy curve are extracted by frame to audio data.Step
In rapid A5, combine the text of time unifying, then the logarithm fundamental curve and logarithmic energy curve of final syllable can be obtained, with
And the logarithm fundamental frequency value and logarithmic energy value of suffix voiced sound frame number and the unvoiced frame of next word prefix.In step A6, calculate
To the logarithm fundamental frequency statistical nature, logarithmic energy statistical nature, suffix unvoiced frame and next word prefix unvoiced frame of final syllable
Logarithm fundamental frequency difference and logarithmic energy difference.In step A7, prosody hierarchy knot is used for by what these acoustic features were spliced to form
The acoustic feature set of the word of structure automatic marking task.
Specifically, training text data are treated and audio data to be trained force obtaining phoneme level after alignment
Boundary information, it is assumed that training text data be " showing best regards and hearty blessing ", target word be " greeting ",
Final syllable duration refers to the time span of word ultima, and by forcing the information of alignment, word can be calculated
The frame boundaries of last or end syllable section, for example, " time " pronunciation of target word " greeting " is " hou ", final syllable is " ou ", then passing through
The 101st frame of start frame serial number of " ou " in audio, 120 frame of end frame serial number, then " ou " pronunciation continues 20 frames, every frame
5 milliseconds, then a length of 100 milliseconds when the pronunciation of " ou ", i.e., when final syllable " greeted ", is 100 milliseconds a length of.
Secondly, after obtaining to training text data and audio data to be trained, treating instruction in the embodiment of the present application
Practice text data and audio data to be trained and carry out pressure alignment, treat training text data and audio data to be trained into
Row forces alignment, obtains the text of time unifying, the final syllable duration of target word is determined further according to time unifying text.It is logical
Aforesaid way is crossed, the text of time unifying can be obtained, and extracts final syllable duration, using final syllable duration as acoustics
One in characteristic set, learn from the acoustic feature set being originally inputted to the advanced features for being conducive to mark automatically, thus
Promote the accuracy of prosody hierarchy marking model.
Optionally, on the basis of above-mentioned Fig. 6 corresponding third embodiment, the embodiment of the present application provides model training
In the 4th alternative embodiment of method, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, it can
To include:
Duration of pausing after the word of target word is determined according to time unifying text.
In the present embodiment, duration of pausing after how obtaining the word of word will be introduced.Specifically, treat training text data with
And audio data to be trained force obtaining M speech frame after alignment, it is assumed that training text data for " it is sincere to show
Greet and hearty blessing ", target word is " greeting ", then next adjacent word of target word is "and", target word
The pause in short-term between " time " and "and" can be calculated according to the text data of time unifying in " time " word of language " greeting "
Duration, pausing in short-term is 20 frames, and 5 milliseconds of every frame is 100 milliseconds a length of when then pausing after the word of target word.
Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated
Practice text data and audio data to be trained carries out pressure alignment, obtains time unifying text, it then can be according to the time pair
Neat text determines duration of pausing after word.By the above-mentioned means, can be determined after forcing aligning texts data and audio data every
It pauses after the word of a word duration, using pause duration after word as one in acoustic feature set, automatically from the sound being originally inputted
It learns in characteristic set and learns to the advanced features for being conducive to mark, to promote the accuracy of prosody hierarchy marking model.
Optionally, on the basis of above-mentioned Fig. 6 corresponding third embodiment, the embodiment of the present application provides model training
In the 5th alternative embodiment of method, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, it can
To include:
According to time unifying text and the fundamental frequency information extracted in audio data to be trained, target word is calculated
Final syllable voiced sound start frame frame number and voiced sound end frame frame number;
Extract the logarithm fundamental curve and logarithmic energy curve of audio data to be trained;
According to the frame number of final syllable voiced sound start frame, the frame number of voiced sound end frame, logarithm fundamental frequency of target word
The final syllable acoustics statistical nature of target word is calculated, wherein final syllable acoustics in curve and logarithmic energy curve
Statistical nature includes at least one in maximum value, minimum value, interval range, average value and the variance of logarithm fundamental curve,
Final syllable acoustics statistical nature further includes maximum value, minimum value, interval range, average value and the variance of logarithmic energy curve
At least one of in.
In the present embodiment, the final syllable acoustics statistical nature for how obtaining word will be introduced.Specifically, training text is treated
Notebook data and audio data to be trained force obtaining time unifying text after alignment, it is assumed that are to training text data
The fundamental frequency and energy of corresponding audio are extracted in " showing best regards and hearty blessing ", framing, thus generate fundamental curve and
Energy curve, in order to make it easy to understand, please referring to Fig. 8 and Fig. 9, Fig. 8 is one embodiment of fundamental curve in the embodiment of the present application
Schematic diagram, Fig. 9 is one embodiment schematic diagram of energy curve in the embodiment of the present application, for authority data, according to two songs
Line takes logarithm to obtain logarithm fundamental curve and logarithmic energy curve, and near prosody hierarchy structure boundary, fundamental frequency and energy are all
It can weaken.Assuming that target word is " greeting ", then terminated according to the frame number of the voiced sound onset speech frame of " greeting " and voiced sound
The frame number of speech frame, it is corresponding from intercepting the target word words and phrases tail in the logarithm fundamental curve and logarithmic energy curve of audio
Logarithm fundamental curve and logarithmic energy curve, it is bent according to corresponding logarithm fundamental curve at target word words and phrases tail and logarithmic energy
Line, calculates separately out the final syllable acoustics statistical nature of following ten dimensions, i.e., the maximum value of logarithm fundamental curve, to base
The side of the minimum value of frequency curve, the interval range of logarithm fundamental curve, the average value of logarithm fundamental curve, logarithm fundamental curve
Difference, the maximum value of logarithmic energy curve, the minimum value of logarithmic energy curve, the interval range of logarithmic energy curve, logarithmic energy
The average value of curve and the variance of logarithmic energy curve.
Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated
Practice text data and audio data to be trained carries out pressure alignment, obtain time unifying text, then according to time unifying text
Originally the fundamental frequency information and extracted in audio data to be trained, is calculated the final syllable voiced sound start frame of target word
The frame number of frame number and voiced sound end frame, logarithm fundamental curve and the logarithmic energy for extracting audio data to be trained are bent
Line, it is finally bent according to the frame number of final syllable voiced sound start frame of target word, the frame number of voiced sound end frame, logarithm fundamental frequency
The final syllable acoustics statistical nature of target word is calculated in line and logarithmic energy curve.By the above-mentioned means, when obtaining
Between the text data that is aligned can obtain the start frame and knot of suffix voiced segments further according to the fundamental frequency information extracted in audio
The frame number of beam frame learns from the acoustic feature set being originally inputted to the advanced features for being conducive to mark, to mention automatically
Rise the accuracy of prosody hierarchy marking model.
Optionally, on the basis of above-mentioned Fig. 6 corresponding third embodiment, the embodiment of the present application provides model training
In the 7th alternative embodiment of method, the acoustic feature set to be trained of each word is extracted according to audio data to be trained, it can
To include:
According to time unifying text and the fundamental frequency information extracted in audio data to be trained, target word is calculated
The last one unvoiced frame frame number and target word next adjacent word prefix unvoiced frame frame number;
According to next adjacent word prefix of the frame number of target word the last one unvoiced frame and target word
The frame number of unvoiced frame, and framing is extracted from audio data to be trained fundamental frequency information and energy information, determine target
The suffix unvoiced frame of word and next adjacent word prefix unvoiced frame fundamental frequency value and energy value between this;
According to the suffix unvoiced frame of target word and next adjacent word prefix unvoiced frame fundamental frequency value between this, calculate
The log difference of fundamental frequency value is obtained, according to the suffix unvoiced frame of target word and next adjacent word prefix unvoiced frame this it
Between energy value, the log difference of energy value is calculated, wherein the log difference of fundamental frequency value and the log difference category of energy value
The acoustic feature changing value between word.
In the present embodiment, the acoustic feature changing value how obtained between word will be introduced.Specifically, training text data are treated
And audio data to be trained force obtaining time unifying text after alignment, it is assumed that training text data is " to show
The fundamental frequency information and energy information of corresponding audio data to be trained are extracted in best regards and hearty blessing ", framing, thus raw
At fundamental curve and energy curve, for authority data, logarithm is taken to obtain logarithm fundamental curve and logarithm according to two curves
Energy curve.Assuming that target word be " greetings ", then according to time unifying text and to audio framing extraction fundamental frequency information,
The last one unvoiced frame that can determine " time ", the frame number with first unvoiced frame of next word "and", then this two frame
Fundamental frequency value and energy value can obtain, then calculate this two frame logarithm fundamental frequency difference and logarithmic energy it is poor.
Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated
Practice text data and audio data to be trained carries out pressure alignment, obtains time unifying text, it then can be according to target word
The frame number of the unvoiced frame of next adjacent word of the last one unvoiced frame of language and target word, and mentioned from audio framing
The fundamental frequency and energy datum taken, determine target word words and phrases tail unvoiced frame fundamental frequency value and energy value and next word prefix it is turbid
The fundamental frequency value and energy value of both sound frames, then carry out the log difference that the two fundamental frequency value is calculated, pair of the two energy value
Difference is counted, thus the changing value as acoustic feature between word.By the above-mentioned means, can be automatically from the acoustic feature being originally inputted
Learn in set to the advanced features for being conducive to mark, to promote the accuracy of prosody hierarchy marking model.
Optionally, above-mentioned Fig. 6 and Fig. 6 it is corresponding first to any one of the 7th embodiment on the basis of, this
Apply for that embodiment is provided in the 8th alternative embodiment of method of model training, to the corresponding term identification of each word, each
The acoustic feature set to be trained to training text characteristic set and each word of word is trained, and obtains prosody hierarchy
Marking model may include:
The first output result of target term identification is obtained by the word embeding layer in prosody hierarchy marking model, wherein
Target term identification corresponds to target word, and target word belongs to any one word at least one word, word embeding layer
It is obtained according to the training of the first model parameter;
Target, which is obtained, by the text neural network in prosody hierarchy marking model waits for the second of training text characteristic set
Export result, wherein target waits for that training text characteristic set corresponds to target word, and text neural network is according to the second model
What parameter training obtained;
First output result, the second output result and target acoustic feature set to be trained are trained, obtain the
Three model parameters, wherein target acoustic feature set to be trained corresponds to target word, and third model parameter is for generating the rhythm
Acoustics neural network in level marking model;
According to the first model parameter, the second model parameter and third model parameter, prosody hierarchy marking model is generated.
In the present embodiment, the method trained and obtain prosody hierarchy marking model will be introduced, in order to make it easy to understand, please referring to figure
10, Figure 10 be a structural schematic diagram of prosody hierarchy marking model in the embodiment of the present application, as shown, being with target word
Example, i.e. term identification are target term identification, and text feature collection is combined into target corresponding to target word language and waits for training text feature
Set, acoustic feature collection are combined into the acoustic feature set to be trained of target corresponding to target word language.Using target term identification as
Thus the input of word embeding layer exports the first output as a result, the first output result is specially that target word identity map obtains
Term vector, the term vector can be 200 dimensions.By target after training text characteristic set (punctuate class after part of speech, word be long and word
Type) input as text neural network (such as feed forward neural network), thus export the second output result.By target sound to be trained
Characteristic set, the first output result and the second output result are learned collectively as acoustics neural network (such as two-way long short-term memory net
Network) input, the posterior probability of each prosody hierarchy structure type of the target word will be exported by softmax layers, for example, non-
The probability on prosody hierarchy boundary is 0.1, and the probability of rhythm word is 0.1, and the probability of prosodic phrase is 0.2, the probability of intonation phrase
It is 0.6, takes the corresponding prosody hierarchy structure of maximum a posteriori probability, as annotation results, then the annotation results of target word are
Intonation phrase.This annotation results is the prediction result that training obtains, and needs to be compared with legitimate reading, i.e., using loss letter
Number takes the minimum value of the two to determine the third model parameter of acoustics neural network.Join in conjunction with the first model parameter, the second model
Several and third model parameter, training obtain prosody hierarchy marking model.Prosody hierarchy marking model using feedforward neural network with
The stacked structure of two-way length memory network in short-term, can mark three kinds of rhythm word, prosodic phrase and intonation phrase fascicules simultaneously
Level structure.
Loss function is the inconsistent degree of the predicted value and true value for estimating your model, it is a non-negative real value
Function.Loss function used by the application can use cross entropy, it is possible to use the cross entropy of Weight.
It is understood that memory network is trained together in short-term for word embeding layer, feedforward neural network and two-way length.Word
For embeding layer for training term vector, feedforward neural network is originally inputted feature (punctuate after part of speech, word be long and word to automatically extract
Type) in the more favorable high-level characteristic of mark task is indicated.In two-way length memory network input in short-term, by these features
It is stitched together, thus joint utilizes text feature and acoustic feature.
Memory network can learn to the dependence between context two-way length in short-term, because mark task is also required to up and down
Literary information, such as previous word are intonation phrasal boundary, then current word is just less likely for intonation phrasal boundary, therefore, connection
It closes and not only may be used using the stacked structure of feedforward neural network and two-way length memory network in short-term using trainable word embeding layer
, can also be automatically from Text character extraction high-level characteristic to utilize text and acoustic feature information, and contextual feature is utilized,
Task is marked suitable for prosody hierarchy structure.
Further, in the embodiment of the present application, a kind of method that training obtains prosody hierarchy marking model is described, that is, is needed
It trains to obtain three classes model parameter, respectively the first model parameter, the second model parameter and third model parameter, the first mould
Shape parameter, the second model parameter and third model parameter as a whole, while training and generating prosody hierarchy marking model.
A complete prosody hierarchy marking model is formed by the above-mentioned means, being stacked by the neural network of three parts, and as one
A entirety carries out model training, and trained content includes the training between term identification and word vectors, text including words and phrases and word
The training of the training of text feature and audio and acoustic feature helps to mention thus, it is possible to obtain feature more abundant
Rise the accuracy of sentence mark.
The prosody hierarchy annotation equipment in the application is described in detail below, please refers to Figure 11, Figure 11 is the application
Prosody hierarchy annotation equipment one embodiment schematic diagram in embodiment, prosody hierarchy annotation equipment 30 include:
Module 301 is obtained, for obtaining text data and audio data to be marked, wherein the textual data to be marked
According to having corresponding relationship between the audio data, the text data to be marked includes at least one word, each word
Corresponding to a term identification;
Extraction module 302 extracts each word for obtaining the text data to be marked that 301 take according to the acquisition module
The text feature set to be marked of language, wherein the text feature set to be marked includes punctuate class after part of speech, word be long and word
Type;
The extraction module 302 is also used to be extracted according to the audio data that the acquisition module 301 obtains described every
The acoustic feature set of a word, wherein the acoustic feature set includes pause duration, suffix after final syllable duration, word
Acoustic feature changing value between syllable acoustics statistical nature and word;
Prediction module 303, for according to the term identification of each word, the extraction of the extraction module 302
The text feature set to be marked of each word and the acoustic feature set of each word, mark mould by prosody hierarchy
Type obtains prosody hierarchy structure.
In the present embodiment, obtains module 301 and obtain text data and audio data to be marked, wherein is described to be marked
There is corresponding relationship, the text data to be marked includes at least one word, often between text data and the audio data
A word corresponds to a term identification, the textual data to be marked that extraction module 302 is obtained according to the acquisition module 301
According to the text feature set to be marked for extracting each word, wherein the text feature set to be marked include part of speech, word it is long with
And punctuate type after word, the extraction module 302 extract described every according to the audio data that the acquisition module 301 obtains
The acoustic feature set of a word, wherein the acoustic feature set includes pause duration, suffix after final syllable duration, word
Acoustic feature changing value between syllable acoustics statistical nature and word, prediction module 303 according to the term identification of each word,
The text feature set to be marked for each word that the extraction module 302 extracts and the acoustics of each word are special
Collection is closed, and obtains prosody hierarchy structure by prosody hierarchy marking model.
In the embodiment of the present application, provide a kind of prosody hierarchy annotation equipment, firstly, obtain text data to be marked and
Audio data, wherein there is corresponding relationship, text data to be marked includes at least between text data and audio data to be marked
One word, each word correspond to a term identification, then according to text data to be marked extract each word wait mark
Explanatory notes eigen set, wherein text feature set to be marked includes punctuate type after part of speech, word be long and word, further according to sound
Frequency is according to the acoustic feature set for extracting each word, wherein acoustic feature set includes when pausing after final syllable duration, word
Acoustic feature changing value between long, final syllable acoustics statistical nature and word, finally according to the term identification of each word, each
The text feature set to be marked of word and the acoustic feature set of each word obtain rhythm by prosody hierarchy marking model
Restrain hierarchical structure.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, it can be rhythm
The mark for restraining level provides feature more abundant, and the more accurate prosody hierarchy marking model of use can be improved prosody hierarchy
The accuracy of mark.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 11, prosody hierarchy mark provided by the embodiments of the present application
In another embodiment of dispensing device 30
The prediction module 303, specifically for by the prosody hierarchy marking model determine rhythm word, prosodic phrase with
And at least one of intonation phrase;
Or,
Rhythm word and/or prosodic phrase are determined by the prosody hierarchy marking model.
Secondly, describing two kinds of common prosody hierarchy mask methods in the embodiment of the present application, one kind is to pass through fascicule
Grade marking model determines that rhythm word, prosodic phrase and intonation phrase, another kind are to determine rhythm by prosody hierarchy marking model
Restrain word and prosodic phrase.By the above-mentioned means, user can choose finer rhythm word, prosodic phrase and intonation phrase
The labelling schemes of three layers of prosody hierarchy structure also can choose two layers of prosody hierarchy structure mark of rhythm word and prosodic phrase
Scheme.Therefore, this programme can also select the prosody hierarchy of output according to demand, thus the flexibility of lifting scheme.
The model training apparatus in the application is described in detail below, please refers to Figure 12, Figure 12 is the application implementation
Model training apparatus one embodiment schematic diagram in example, model training apparatus 40 include:
Module 401 is obtained, for obtaining to training text data and audio data to be trained, wherein described wait train
There is corresponding relationship between text data and the audio data to be trained, it is described to include at least one in training text data
Word, each word correspond to a term identification;
Extraction module 402, for extracting each word to training text data according to the acquisition module 401 acquisition
Language to training text characteristic set, wherein it is described after training text characteristic set include punctuate class after part of speech, word be long and word
Type;
The extraction module 402 is also used to the audio data extraction to be trained according to the acquisition module 401 acquisition
The acoustic feature set to be trained of each word, wherein the acoustic feature set to be trained include final syllable duration,
Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after word;
Training module 403, the institute for being extracted to the corresponding term identification of each word, the extraction module 402
The acoustic feature set to be trained to training text characteristic set and each word for stating each word is trained, and is obtained
To prosody hierarchy marking model, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure.
In the present embodiment, obtains module 401 and obtain to training text data and audio data to be trained, wherein is described
It is described to include extremely in training text data to have corresponding relationship between training text data and the audio data to be trained
A few word, each word correspond to a term identification, the institute that extraction module 402 is obtained according to the acquisition module 401
State to training text data extract each word to training text characteristic set, wherein it is described to training text characteristic set
Including part of speech, punctuate type after word is long and word, the extraction module 402 according to the acquisitions module 401 acquisition to
Training audio data extracts the acoustic feature set to be trained of each word, wherein the acoustic feature set to be trained
Including acoustic feature changing value, training between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word
Each word that module 403 extracts the corresponding term identification of each word, the extraction module 402 wait train
Text feature set and the acoustic feature set to be trained of each word are trained, and obtain prosody hierarchy mark mould
Type, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure.
In the embodiment of the present application, provide a kind of method of model training, firstly, obtain to training text data and to
Training audio data, wherein to have corresponding relationship between training text data and audio data to be trained, each word is corresponding
In a term identification, then according to training text data extract each word to training text characteristic set, wherein to
Training text characteristic set includes punctuate type after part of speech, word be long and word, and is extracted each according to audio data train
The acoustic feature set to be trained of word, wherein when pausing after training acoustic feature set includes final syllable duration, word
Acoustic feature changing value between long, final syllable acoustics statistical nature and word, finally to the corresponding term identification of each word, every
The acoustic feature set to be trained to training text characteristic set and each word of a word is trained, and obtains fascicule
Grade marking model.By the above-mentioned means, joint text feature and acoustic feature establish prosody hierarchy marking model, it can be rhythm
It restrains level mark task and feature more abundant is provided, the more accurate prosody hierarchy marking model of use can be improved fascicule
The accuracy of grade mark, and promote the effect of speech synthesis.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 13 is please referred to, it is provided by the embodiments of the present application
In another embodiment of model training apparatus 40, the model training apparatus 40 further includes processing module 404 and generation module
405;
The processing module 404, for obtaining in the acquisition module 401 to training text data and audio to be trained
After data, word segmentation processing is carried out to training text data to described, obtains at least one word;
The acquisition module 401 is also used to the target word according to corresponding to default term identification Relation acquisition target word
Language mark, wherein the default term identification relationship is used to indicate the pass between preset each word and term identification
System, the target word belong to any one word at least one described word that the processing module is handled;
The generation module 405, it is described to target corresponding to target word described in training text data for generating
Term vector;
The training module 403, specifically for the target term identification obtained to the acquisition module 401 and institute
The target term vector for stating the generation of generation module 405 is trained, and obtains the first model parameter, wherein first model
Parameter is used to generate the word embeding layer in the prosody hierarchy marking model.
Secondly, in the embodiment of the present application, a kind of method for describing trained word embeding layer first treats training text data
Carry out word segmentation processing, the then target term identification according to default term identification Relation acquisition target word corresponding to, generation to
Target term vector corresponding to target word in training text data, then target term identification and target term vector are instructed
Practice, obtain the first model parameter, wherein the first model parameter is used to generate the word embeding layer in prosody hierarchy marking model.It is logical
Aforesaid way is crossed, can directly train the word embeding layer obtained in prosody hierarchy marking model, and when training word embeding layer
Other neural networks that can also simultaneously in training rhythm level marking model, to save additionally with independent neural network
The process of training term vector model, improves training effectiveness.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, model training dress provided by the embodiments of the present application
It sets in 40 another embodiment,
The extraction module 402, specifically for obtain it is described to the part of speech of target word, word in training text data it is long with
And punctuate type after word, wherein the part of speech indicates the syntactic category of word as a result, the long number of words for indicating word of institute's predicate, institute
Punctuate type is used to indicate punctuate type corresponding after word after predicate;
Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein described
Related word is the word for having incidence relation with the target word;
The training module 403, specifically for the part of speech to the target word, punctuate type after word is long and word, and
Punctuate type is trained after the part of speech of the related word, word be long and word, obtains the second model parameter, second model
Parameter is used to generate the text neural network in the prosody hierarchy marking model.
Secondly, in the embodiment of the present application, a kind of method for describing training text neural network is first obtained to training text
Punctuate type after the part of speech of target word, word length and word in notebook data, and obtain to related word in training text data
Part of speech, punctuate type after word is long and word, the then punctuate type to the part of speech of target word, after word is long and word and is associated with
Punctuate type is trained after the part of speech of word, word be long and word, obtains the second model parameter, the second model parameter is for generating
Text neural network in prosody hierarchy marking model.By the above-mentioned means, system can be allowed to learn automatically by neural network
The high-level characteristic expression for being conducive to prosody hierarchy structure mark, learns from the text feature set being originally inputted to advantageous automatically
In the advanced features of mark, to promote the accuracy of prosody hierarchy marking model.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 14 is please referred to, it is provided by the embodiments of the present application
In another embodiment of model training apparatus 40, the model training apparatus 40 further includes alignment module 406;
The alignment module 406 is obtained for the acquisition module 401 to training text data and audio number to be trained
According to later, pressure alignment is carried out to training text data and the audio data to be trained to described, obtains time unifying
Text;
The extraction module 402, specifically for determining the suffix sound of the target word according to the time unifying text
Save duration.
Secondly, after obtaining to training text data and audio data to be trained, treating instruction in the embodiment of the present application
Practice text data and audio data to be trained and carry out pressure alignment, treat training text data and audio data to be trained into
Row forces alignment, obtains the text of time unifying, the final syllable duration of target word is determined further according to time unifying text.It is logical
Aforesaid way is crossed, the text of time unifying can be obtained, and extracts final syllable duration, using final syllable duration as acoustics
One in characteristic set, learn from the acoustic feature set being originally inputted to the advanced features for being conducive to mark automatically, thus
Promote the accuracy of prosody hierarchy marking model.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 14, model training dress provided by the embodiments of the present application
It sets in 40 another embodiment,
The extraction module 402 stops after the word specifically for determining the target word according to the time unifying text
Immediately it grows.
Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated
Practice text data and audio data to be trained carries out pressure alignment, obtains time unifying text, it then can be according to the time pair
Neat text determines duration of pausing after word.By the above-mentioned means, can be determined after forcing aligning texts data and audio data every
It pauses after the word of a word duration, using pause duration after word as one in acoustic feature set, automatically from the sound being originally inputted
It learns in characteristic set and learns to the advanced features for being conducive to mark, to promote the accuracy of prosody hierarchy marking model.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 14, model training dress provided by the embodiments of the present application
It sets in 40 another embodiment,
The extraction module 402 is specifically used for according to the time unifying text and from the audio data to be trained
In the fundamental frequency information that extracts, the frame number and voiced sound that the final syllable voiced sound start frame of the target word is calculated terminate
The frame number of frame;
Extract the logarithm fundamental curve and logarithmic energy curve of the audio data to be trained;
According to the frame number of the final syllable voiced sound start frame of the target word, the frame number, described of voiced sound end frame
The final syllable acoustics statistical nature of the target word is calculated in logarithm fundamental curve and logarithmic energy curve, wherein
The final syllable acoustics statistical nature include the maximum value of the logarithm fundamental curve, minimum value, interval range, average value with
At least one of and in variance, the final syllable acoustics statistical nature further includes the maximum value, most of the logarithmic energy curve
At least one of in small value, interval range, average value and variance.
Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated
Practice text data and audio data to be trained carries out pressure alignment, obtain time unifying text, then according to time unifying text
Originally the fundamental frequency information and extracted in audio data to be trained, is calculated the final syllable voiced sound start frame of target word
The frame number of frame number and voiced sound end frame, logarithm fundamental curve and the logarithmic energy for extracting audio data to be trained are bent
Line, it is finally bent according to the frame number of final syllable voiced sound start frame of target word, the frame number of voiced sound end frame, logarithm fundamental frequency
The final syllable acoustics statistical nature of target word is calculated in line and logarithmic energy curve.By the above-mentioned means, when obtaining
Between the text data that is aligned can obtain the start frame and knot of suffix voiced segments further according to the fundamental frequency information extracted in audio
The frame number of beam frame learns from the acoustic feature set being originally inputted to the advanced features for being conducive to mark, to mention automatically
Rise the accuracy of prosody hierarchy marking model.
Optionally, on the basis of the embodiment corresponding to above-mentioned Figure 14, model training dress provided by the embodiments of the present application
It sets in 40 another embodiment,
The extraction module 402 is specifically used for according to the time unifying text and from the audio data to be trained
In the fundamental frequency information that extracts, the frame number and the target word of the last one unvoiced frame of the target word is calculated
Next adjacent word prefix unvoiced frame frame number;
According to the frame number of the target word the last one unvoiced frame and next adjacent word of the target word
The frame number of the unvoiced frame of words and phrases head, and fundamental frequency information and the energy letter that framing is extracted from the audio data to be trained
Breath determines suffix unvoiced frame and next adjacent word prefix unvoiced frame fundamental frequency value and the energy between this of the target word
Magnitude;
According to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame fundamental frequency value between this,
The log difference of fundamental frequency value is calculated, it is turbid according to the suffix unvoiced frame of the target word and next adjacent word prefix
Sound frame energy value between this, is calculated the log difference of energy value, wherein the log difference and the energy of the fundamental frequency value
The log difference of magnitude belongs to acoustic feature changing value between institute's predicate.
Again, in the embodiment of the present application, after obtaining to training text data and audio data to be trained, instruction is treated
Practice text data and audio data to be trained carries out pressure alignment, obtains time unifying text, it then can be according to target word
The frame number of the unvoiced frame of next adjacent word of the last one unvoiced frame of language and target word, and mentioned from audio framing
The fundamental frequency and energy datum taken, determine target word words and phrases tail unvoiced frame fundamental frequency value and energy value and next word prefix it is turbid
The fundamental frequency value and energy value of both sound frames, then carry out the log difference that the two fundamental frequency value is calculated, pair of the two energy value
Difference is counted, thus the changing value as acoustic feature between word.By the above-mentioned means, can be automatically from the acoustic feature being originally inputted
Learn in set to the advanced features for being conducive to mark, to promote the accuracy of prosody hierarchy marking model.
Optionally, provided by the embodiments of the present application on the basis of the embodiment corresponding to above-mentioned Figure 12, Figure 13 or Figure 14
In another embodiment of model training apparatus 40,
The training module 403, specifically for obtaining target by the word embeding layer in the prosody hierarchy marking model
First output result of term identification, wherein the target term identification corresponds to target word, and the target word belongs to institute
Any one word at least one word is stated, institute's predicate embeding layer is obtained according to the training of the first model parameter;
Target, which is obtained, by the text neural network in the prosody hierarchy marking model waits for training text characteristic set
Second output result, wherein the target waits for that training text characteristic set corresponds to the target word, the text nerve net
Network is obtained according to the training of the second model parameter;
The first output result, the second output result and target acoustic feature set to be trained are instructed
To practice, obtains third model parameter, wherein the target acoustic feature set to be trained is corresponding to the target word, and described the
Three model parameters are used to generate the acoustics neural network in the prosody hierarchy marking model;
According to first model parameter, second model parameter and the third model parameter, the rhythm is generated
Restrain level marking model.
Further, in the embodiment of the present application, a kind of method that training obtains prosody hierarchy marking model is described, that is, is needed
It trains to obtain three classes model parameter, respectively the first model parameter, the second model parameter and third model parameter, the first mould
Shape parameter, the second model parameter and third model parameter as a whole, while training and generating prosody hierarchy marking model.
A complete prosody hierarchy marking model is formed by the above-mentioned means, being stacked by the neural network of three parts, and as one
A entirety carries out model training, and trained content includes the training between term identification and word vectors, text including words and phrases and word
The training of the training of text feature and audio and acoustic feature helps to mention thus, it is possible to obtain feature more abundant
Rise the accuracy of sentence mark.
The embodiment of the present application also provides another prosody hierarchy annotation equipments, as shown in figure 15, for ease of description, only
Part relevant to the embodiment of the present application is shown, it is disclosed by specific technical details, please refer to the embodiment of the present application method portion
Point.The terminal device can be include mobile phone, tablet computer, personal digital assistant (personal digital assistant,
PDA), any terminal device equipment such as point-of-sale terminal equipment (point of sales, POS), vehicle-mounted computer is with terminal device
For mobile phone:
Figure 15 shows the block diagram of the part-structure of mobile phone relevant to terminal device provided by the embodiments of the present application.Ginseng
Figure 15 is examined, mobile phone includes: radio frequency (radio frequency, RF) circuit 510, memory 520, input unit 530, display list
First 540, sensor 550, voicefrequency circuit 560, Wireless Fidelity (wireless fidelity, WiFi) module 570, processor
The components such as 580 and power supply 590.It will be understood by those skilled in the art that handset structure shown in Figure 15 does not constitute opponent
The restriction of machine may include perhaps combining certain components or different component layouts than illustrating more or fewer components.
It is specifically introduced below with reference to each component parts of the Figure 15 to mobile phone:
RF circuit 510 can be used for receiving and sending messages or communication process in, signal sends and receivees, particularly, by base station
After downlink information receives, handled to processor 580;In addition, the data for designing uplink are sent to base station.In general, RF circuit 510
Including but not limited to antenna, at least one amplifier, transceiver, coupler, low-noise amplifier (low noise
Amplifier, LNA), duplexer etc..In addition, RF circuit 510 can also be communicated with network and other equipment by wireless communication.
Any communication standard or agreement, including but not limited to global system for mobile communications (global can be used in above-mentioned wireless communication
System of mobile communication, GSM), general packet radio service (general packet radio
Service, GPRS), CDMA (code division multiple access, CDMA), wideband code division multiple access
(wideband code division multiple access, WCDMA), long term evolution (long term evolution,
LTE), Email, short message service (short messaging service, SMS) etc..
Memory 520 can be used for storing software program and module, and processor 580 is stored in memory 520 by operation
Software program and module, thereby executing the various function application and data processing of mobile phone.Memory 520 can mainly include
Storing program area and storage data area, wherein storing program area can application journey needed for storage program area, at least one function
Sequence (such as sound-playing function, image player function etc.) etc.;Storage data area can be stored to be created according to using for mobile phone
Data (such as audio data, phone directory etc.) etc..It, can be in addition, memory 520 may include high-speed random access memory
Including nonvolatile memory, for example, at least a disk memory, flush memory device or other volatile solid-states
Part.
Input unit 530 can be used for receiving the number or character information of input, and generate with the user setting of mobile phone with
And the related key signals input of function control.Specifically, input unit 530 may include that touch panel 531 and other inputs are set
Standby 532.Touch panel 531, also referred to as touch screen, collect user on it or nearby touch operation (such as user use
The operation of any suitable object or attachment such as finger, stylus on touch panel 531 or near touch panel 531), and root
Corresponding attachment device is driven according to preset formula.Optionally, touch panel 531 may include touch detecting apparatus and touch
Two parts of controller.Wherein, the touch orientation of touch detecting apparatus detection user, and touch operation bring signal is detected,
Transmit a signal to touch controller;Touch controller receives touch information from touch detecting apparatus, and is converted into touching
Point coordinate, then gives processor 580, and can receive order that processor 580 is sent and be executed.Furthermore, it is possible to using electricity
The multiple types such as resistive, condenser type, infrared ray and surface acoustic wave realize touch panel 531.In addition to touch panel 531, input
Unit 530 can also include other input equipments 532.Specifically, other input equipments 532 can include but is not limited to secondary or physical bond
One of disk, function key (such as volume control button, switch key etc.), trace ball, mouse, operating stick etc. are a variety of.
Display unit 540 can be used for showing information input by user or be supplied to user information and mobile phone it is various
Menu.Display unit 540 may include display panel 541, optionally, can use liquid crystal display (liquid crystal
Display, LCD), the forms such as Organic Light Emitting Diode (organic light-emitting diode, OLED) it is aobvious to configure
Show panel 541.Further, touch panel 531 can cover display panel 541, when touch panel 531 detect it is on it or attached
After close touch operation, processor 580 is sent to determine the type of touch event, is followed by subsequent processing device 580 according to touch event
Type corresponding visual output is provided on display panel 541.Although in Figure 15, touch panel 531 and display panel 541
It is that the input and input function of mobile phone are realized as two independent components, but in some embodiments it is possible to by touch-control
Panel 531 and display panel 541 are integrated and that realizes mobile phone output and input function.
Mobile phone may also include at least one sensor 550, such as optical sensor, motion sensor and other sensors.
Specifically, optical sensor may include ambient light sensor and proximity sensor, wherein ambient light sensor can be according to ambient light
Light and shade adjust the brightness of display panel 541, proximity sensor can close display panel 541 when mobile phone is moved in one's ear
And/or backlight.As a kind of motion sensor, accelerometer sensor can detect (generally three axis) acceleration in all directions
Size, can detect that size and the direction of gravity when static, can be used to identify the application of mobile phone posture, (for example horizontal/vertical screen is cut
Change, dependent game, magnetometer pose calibrating), Vibration identification correlation function (such as pedometer, tap) etc.;May be used also as mobile phone
The other sensors such as gyroscope, barometer, hygrometer, thermometer, the infrared sensor of configuration, details are not described herein.
Voicefrequency circuit 560, loudspeaker 561, microphone 562 can provide the audio interface between user and mobile phone.Audio-frequency electric
Electric signal after the audio data received conversion can be transferred to loudspeaker 561, be converted to sound by loudspeaker 561 by road 560
Signal output;On the other hand, the voice signal of collection is converted to electric signal by microphone 562, is turned after being received by voicefrequency circuit 560
It is changed to audio data, then by after the processing of audio data output processor 580, such as another mobile phone is sent to through RF circuit 510,
Or audio data is exported to memory 520 to be further processed.
WiFi belongs to short range wireless transmission technology, and mobile phone can help user's transceiver electronics postal by WiFi module 570
Part, browsing webpage and access streaming video etc., it provides wireless broadband internet access for user.Although Figure 15 is shown
WiFi module 570, but it is understood that, and it is not belonging to must be configured into for mobile phone, it can according to need do not changing completely
Become in the range of the essence of invention and omits.
Processor 580 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, is led to
It crosses operation or executes the software program and/or module being stored in memory 520, and call and be stored in memory 520
Data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 580 can wrap
Include one or more processing units;Optionally, processor 580 can integrate application processor and modem processor, wherein answer
With the main processing operation system of processor, user interface and application program etc., modem processor mainly handles wireless communication.
It is understood that above-mentioned modem processor can not also be integrated into processor 580.
Mobile phone further includes the power supply 590 (such as battery) powered to all parts, and optionally, power supply can pass through power supply pipe
Reason system and processor 580 are logically contiguous, to realize management charging, electric discharge and power managed by power-supply management system
Etc. functions.
Although being not shown, mobile phone can also include camera, bluetooth module etc., and details are not described herein.
In the embodiment of the present application, processor 580 included by the terminal device is also with the following functions:
Obtain text data and audio data to be marked, wherein the text data to be marked and the audio data
Between there is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a word mark
Know;
The text feature set to be marked of each word is extracted according to the text data to be marked, wherein described wait mark
Explanatory notes eigen set includes punctuate type after part of speech, word be long and word;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set
Including acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and described every
The acoustic feature set of a word obtains prosody hierarchy structure by prosody hierarchy marking model.
In the embodiment of the present application, processor 580 included by the terminal device is also with the following functions:
Obtain to training text data and audio data train, wherein it is described to training text data with it is described to
There is corresponding relationship between training audio data, it is described to include at least one word, each word pair in training text data
Ying Yuyi term identification;
According to it is described to training text data extract each word to training text characteristic set, wherein it is described wait instruct
Practicing text feature set includes punctuate type after part of speech, word be long and word;
The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described
After training acoustic feature set includes final syllable duration, word between pause duration, final syllable acoustics statistical nature and word
Acoustic feature changing value;
To the corresponding term identification of each word, each word to training text characteristic set and described
The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy mark
Injection molding type is for being labeled prosody hierarchy structure.
Figure 16 is a kind of server architecture schematic diagram provided by the embodiments of the present application, which can be because of configuration or property
Energy is different and generates bigger difference, may include one or more central processing units (central processing
Units, CPU) 622 (for example, one or more processors) and memory 632, one or more storages apply journey
The storage medium 630 (such as one or more mass memory units) of sequence 642 or data 644.Wherein, 632 He of memory
Storage medium 630 can be of short duration storage or persistent storage.The program for being stored in storage medium 630 may include one or one
With upper module (diagram does not mark), each module may include to the series of instructions operation in server.Further, in
Central processor 622 can be set to communicate with storage medium 630, execute on server 600 a series of in storage medium 630
Instruction operation.
Server 600 can also include one or more power supplys 626, one or more wired or wireless networks
Interface 650, one or more input/output interfaces 658, and/or, one or more operating systems 641, such as
Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by server can be based on server architecture shown in the Figure 16 in above-described embodiment.
In the embodiment of the present application, CPU 622 included by the server is also with the following functions:
Obtain text data and audio data to be marked, wherein the text data to be marked and the audio data
Between there is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a word mark
Know;
The text feature set to be marked of each word is extracted according to the text data to be marked, wherein described wait mark
Explanatory notes eigen set includes punctuate type after part of speech, word be long and word;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set
Including acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and described every
The acoustic feature set of a word obtains prosody hierarchy structure by prosody hierarchy marking model.
In the embodiment of the present application, CPU 622 included by the server is also with the following functions:
Obtain to training text data and audio data train, wherein it is described to training text data with it is described to
There is corresponding relationship between training audio data, it is described to include at least one word, each word pair in training text data
Ying Yuyi term identification;
According to it is described to training text data extract each word to training text characteristic set, wherein it is described wait instruct
Practicing text feature set includes punctuate type after part of speech, word be long and word;
The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described
After training acoustic feature set includes final syllable duration, word between pause duration, final syllable acoustics statistical nature and word
Acoustic feature changing value;
To the corresponding term identification of each word, each word to training text characteristic set and described
The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy mark
Injection molding type is for being labeled prosody hierarchy structure.
With artificial intelligence technology research and progress, research and application is unfolded in multiple fields in artificial intelligence technology, such as
Common smart home, intelligent wearable device, virtual assistant, intelligent sound box, intelligent marketing, unmanned, automatic Pilot, nobody
Machine, robot, intelligent medical, intelligent customer service etc., it is believed that with the development of technology, artificial intelligence technology will obtain in more fields
To application, and play more and more important value.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application
Portion or part steps.And storage medium above-mentioned include: USB flash disk, mobile hard disk, read-only memory (read-only memory,
ROM), random access memory (random access memory, RAM), magnetic or disk etc. are various can store program
The medium of code.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before
Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.
Claims (15)
1. a kind of method of prosody hierarchy mark characterized by comprising
Obtain text data and audio data to be marked, wherein between the text data to be marked and the audio data
With corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a term identification;
The text feature set to be marked of each word is extracted according to the text data to be marked, wherein the text to be marked
Eigen set includes punctuate type after part of speech, word be long and word, and the audio data is voice data;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set includes
Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and each word
The acoustic feature set of language obtains prosody hierarchy structure by prosody hierarchy marking model, wherein the prosody hierarchy structure packet
At least one of rhythm word, prosodic phrase and intonation phrase is included, or, the prosody hierarchy structure includes that rhythm word and the rhythm are short
At least one of language.
2. a kind of method of model training characterized by comprising
Obtain to training text data and audio data to be trained, wherein it is described to training text data with described wait train
There is corresponding relationship between audio data, it is described to which, including at least one word, each word corresponds in training text data
One term identification, the audio data to be trained are voice data;
According to it is described to training text data extract each word to training text characteristic set, wherein it is described to training text
Eigen set includes punctuate type after part of speech, word be long and word;
The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described wait instruct
Vocalism characteristic set includes pause duration after final syllable duration, word, acoustics between final syllable acoustics statistical nature and word
Changing features value;
To the corresponding term identification of each word, each word to training text characteristic set and described each
The acoustic feature set to be trained of word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy marks mould
For type for being labeled to prosody hierarchy structure, the prosody hierarchy structure includes rhythm word, prosodic phrase and intonation phrase
At least one, or, the prosody hierarchy structure includes at least one of rhythm word and prosodic phrase.
3. according to the method described in claim 2, it is characterized in that, the acquisition is to training text data and audio to be trained
After data, the method also includes:
Word segmentation processing is carried out to training text data to described, obtains at least one word;
According to target term identification corresponding to default term identification Relation acquisition target word, wherein the default word mark
Knowledge relationship is used to indicate relationship between preset each word and term identification, the target word belong to it is described at least
Any one word in one word;
It generates described to target term vector corresponding to target word described in training text data;
It is described to the corresponding term identification of each word, each word to training text characteristic set and described
The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, comprising:
The target term identification and the target term vector are trained, the first model parameter is obtained, wherein described
One model parameter is used to generate word embeding layer in the prosody hierarchy marking model, institute's predicate embeding layer within the object time into
Row updates.
4. according to the method described in claim 2, it is characterized in that, described extract each word to training text data according to
Language to training text characteristic set, comprising:
Obtain the punctuate type after the part of speech of target word, word length and word in training text data, wherein the part of speech
The syntactic category of word is indicated as a result, institute's predicate length indicates the number of words of word, punctuate type is for indicating in word after institute's predicate
Corresponding punctuate type later;
Obtain the punctuate type after the part of speech of related word, word length and word in training text data, wherein the association
Word is the word for having incidence relation with the target word;
It is described to the corresponding term identification of each word, each word to training text characteristic set and described
The acoustic feature set to be trained of each word is trained, and obtains prosody hierarchy marking model, comprising:
Using loss function to the word of punctuate type and the related word after the part of speech, word length and word of the target word
Property, punctuate type is trained after word is long and word;
When the loss function reaches minimum value, the second model parameter is obtained, second model parameter is described for generating
Text neural network in prosody hierarchy marking model.
5. according to the method described in claim 2, it is characterized in that, the acquisition is to training text data and audio to be trained
After data, the method also includes:
Pressure alignment is carried out to training text data and the audio data to be trained to described, obtains the text of time unifying
This;
The audio data to be trained according to extracts the acoustic feature set to be trained of each word, comprising:
The final syllable duration of the target word is determined according to the time unifying text.
6. according to the method described in claim 5, it is characterized in that, the audio data extraction to be trained according to is described every
The acoustic feature set to be trained of a word, comprising:
Duration of pausing after the word of the target word is determined according to the time unifying text.
7. according to the method described in claim 5, it is characterized in that, the audio data extraction to be trained according to is described every
The acoustic feature set to be trained of a word, comprising:
According to the time unifying text and the fundamental frequency information extracted in the audio data to be trained, it is calculated described
The frame number of the final syllable voiced sound start frame of target word and the frame number of voiced sound end frame;
Extract the logarithm fundamental curve and logarithmic energy curve of the audio data to be trained;
According to the frame number of final syllable voiced sound start frame, the frame number of voiced sound end frame, the logarithm of the target word
The final syllable acoustics statistical nature of the target word is calculated, wherein described in fundamental curve and logarithmic energy curve
Final syllable acoustics statistical nature includes maximum value, minimum value, interval range, average value and the side of the logarithm fundamental curve
In difference at least one of, the final syllable acoustics statistical nature further include the maximum value of the logarithmic energy curve, minimum value,
At least one of in interval range, average value and variance.
8. according to the method described in claim 5, it is characterized in that, the audio data extraction to be trained according to is described every
The acoustic feature set to be trained of a word, comprising:
According to the time unifying text and the fundamental frequency information extracted in the audio data to be trained, it is calculated described
The unvoiced frame of next adjacent word prefix of the frame number of the last one unvoiced frame of target word and the target word
Frame number;
According to the frame number of the target word the last one unvoiced frame and next adjacent word word of the target word
The frame number of first unvoiced frame, and framing is extracted from the audio data to be trained fundamental frequency information and energy information,
Determine suffix unvoiced frame and next adjacent word prefix unvoiced frame fundamental frequency value and the energy between this of the target word
Value;
According to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame fundamental frequency value between this, calculate
The log difference of fundamental frequency value is obtained, according to the suffix unvoiced frame of the target word and next adjacent word prefix unvoiced frame
The log difference of energy value is calculated in energy value between this, wherein the log difference and the energy value of the fundamental frequency value
Log difference belong to acoustic feature changing value between institute's predicate.
9. the method according to any one of claim 2 to 8, which is characterized in that described corresponding to each word
The acoustic feature set to be trained to training text characteristic set and each word of term identification, each word
It is trained, obtains prosody hierarchy marking model, comprising:
The first output result of target term identification is obtained by the word embeding layer in the prosody hierarchy marking model, wherein
The target term identification corresponds to target word, and the target word belongs to any one word at least one described word
Language, institute's predicate embeding layer are obtained according to the training of the first model parameter;
Target, which is obtained, by the text neural network in the prosody hierarchy marking model waits for the second of training text characteristic set
Export result, wherein the target waits for that training text characteristic set corresponds to the target word, and the text neural network is
It is obtained according to the training of the second model parameter;
The first output result, the second output result and target acoustic feature set to be trained are trained, obtained
To third model parameter, wherein the target acoustic feature set to be trained corresponds to the target word, the third model
Parameter is used to generate the acoustics neural network in the prosody hierarchy marking model;
According to first model parameter, second model parameter and the third model parameter, the fascicule is generated
Grade marking model.
10. a kind of prosody hierarchy annotation equipment characterized by comprising
Obtain module, for obtaining text data and audio data to be marked, wherein the text data to be marked with it is described
There is corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to one between audio data
A term identification, the audio data are voice data;
Extraction module, the text data to be marked for being obtained according to the acquisition module extract the to be marked of each word
Text feature set, wherein the text feature set to be marked includes punctuate type after part of speech, word be long and word;
The extraction module is also used to extract the sound of each word according to the audio data that the acquisition module obtains
Learn characteristic set, wherein the acoustic feature set includes pause duration, final syllable acoustics system after final syllable duration, word
Count acoustic feature changing value between feature and word;
Prediction module, each word for being extracted according to the term identification of each word, the extraction module
The acoustic feature set of text feature set to be marked and each word obtains the rhythm by prosody hierarchy marking model
Hierarchical structure, wherein the prosody hierarchy structure includes at least one of rhythm word, prosodic phrase and intonation phrase, or, institute
State at least one that prosody hierarchy structure includes rhythm word and prosodic phrase.
11. a kind of model training apparatus characterized by comprising
Module is obtained, for obtaining to training text data and audio data to be trained, wherein described to training text data
There is corresponding relationship between the audio data to be trained, it is described to include at least one word in training text data, often
A word corresponds to a term identification, and the audio data to be trained is voice data;
Extraction module, for according to it is described acquisition module obtain described in training text data extract each word wait train
Text feature set, wherein it is described after training text characteristic set include punctuate type after part of speech, word be long and word;
The extraction module is also used to audio data extraction to be trained each word according to acquisition module acquisition
The acoustic feature set to be trained of language, wherein it is described after training acoustic feature set include final syllable duration, word after pause when
Acoustic feature changing value between long, final syllable acoustics statistical nature and word;
Training module, each word for being extracted to the corresponding term identification of each word, the extraction module
The acoustic feature set to be trained to training text characteristic set and each word be trained, obtain prosody hierarchy
Marking model, wherein the prosody hierarchy marking model is for being labeled prosody hierarchy structure, the prosody hierarchy structure
At least one including rhythm word, prosodic phrase and intonation phrase, or, the prosody hierarchy structure includes rhythm word and the rhythm
At least one of phrase.
12. a kind of terminal device characterized by comprising memory, transceiver, processor and bus system;
Wherein, the memory is for storing program;
The processor is used to execute the program in the memory, includes the following steps:
Obtain text data and audio data to be marked, wherein between the text data to be marked and the audio data
With corresponding relationship, the text data to be marked includes at least one word, and each word corresponds to a term identification;
The text feature set to be marked of each word is extracted according to the text data to be marked, wherein the text to be marked
Eigen set includes punctuate type after part of speech, word be long and word, and the audio data is voice data;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set includes
Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and each word
The acoustic feature set of language obtains prosody hierarchy structure by prosody hierarchy marking model, wherein the prosody hierarchy structure packet
At least one of rhythm word, prosodic phrase and intonation phrase is included, or, the prosody hierarchy structure includes that rhythm word and the rhythm are short
At least one of language;
The bus system is for connecting the memory and the processor, so that the memory and the processor
It is communicated.
13. a kind of server characterized by comprising memory, transceiver, processor and bus system;
Wherein, the memory is for storing program;
The processor is used to execute the program in the memory, includes the following steps:
Obtain to training text data and audio data to be trained, wherein it is described to training text data with described wait train
There is corresponding relationship between audio data, it is described to which, including at least one word, each word corresponds in training text data
One term identification, the audio data to be trained are voice data;
According to it is described to training text data extract each word to training text characteristic set, wherein it is described to training text
Eigen set includes punctuate type after part of speech, word be long and word;
The acoustic feature set to be trained of each word is extracted according to the audio data to be trained, wherein described wait instruct
Vocalism characteristic set includes pause duration after final syllable duration, word, acoustics between final syllable acoustics statistical nature and word
Changing features value;
To the corresponding term identification of each word, each word to training text characteristic set and described each
The acoustic feature set to be trained of word is trained, and obtains prosody hierarchy marking model, wherein the prosody hierarchy marks mould
For type for being labeled to prosody hierarchy structure, the prosody hierarchy structure includes rhythm word, prosodic phrase and intonation phrase
At least one, or, the prosody hierarchy structure includes at least one of rhythm word and prosodic phrase;
The bus system is for connecting the memory and the processor, so that the memory and the processor
It is communicated.
14. a kind of intelligent speech interactive system, which is characterized in that a kind of intelligent speech interactive system includes voice collecting mould
Block, speech processes analysis module and memory module;
Wherein, the voice acquisition module, for obtaining text data and audio data to be marked, wherein described to be marked
There is corresponding relationship, the text data to be marked includes at least one word, often between text data and the audio data
A word corresponds to a term identification;
The speech processes analysis module, the text to be marked for extracting each word according to the text data to be marked are special
Collection is closed, wherein the text feature set to be marked includes punctuate type after part of speech, word be long and word, the audio data
For voice data;
The acoustic feature set of each word is extracted according to the audio data, wherein the acoustic feature set includes
Acoustic feature changing value between pause duration, final syllable acoustics statistical nature and word after final syllable duration, word;
According to the term identification of each word, the text feature set to be marked of each word and each word
The acoustic feature set of language obtains prosody hierarchy structure by prosody hierarchy marking model, wherein the prosody hierarchy structure packet
At least one of rhythm word, prosodic phrase and intonation phrase is included, or, the prosody hierarchy structure includes that rhythm word and the rhythm are short
At least one of language;
The memory module, for storing the prosody hierarchy structure.
15. a kind of computer readable storage medium, including instruction, when run on a computer, so that computer executes such as
Method described in claim 1, or execute the method as described in any one of claim 2 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910751371.6A CN110444191B (en) | 2019-01-22 | 2019-01-22 | Rhythm level labeling method, model training method and device |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910751371.6A CN110444191B (en) | 2019-01-22 | 2019-01-22 | Rhythm level labeling method, model training method and device |
CN201910060152.3A CN109697973A (en) | 2019-01-22 | 2019-01-22 | A kind of method, the method and device of model training of prosody hierarchy mark |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910060152.3A Division CN109697973A (en) | 2019-01-22 | 2019-01-22 | A kind of method, the method and device of model training of prosody hierarchy mark |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110444191A true CN110444191A (en) | 2019-11-12 |
CN110444191B CN110444191B (en) | 2021-11-26 |
Family
ID=66234262
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910751371.6A Active CN110444191B (en) | 2019-01-22 | 2019-01-22 | Rhythm level labeling method, model training method and device |
CN201910060152.3A Pending CN109697973A (en) | 2019-01-22 | 2019-01-22 | A kind of method, the method and device of model training of prosody hierarchy mark |
Family Applications After (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910060152.3A Pending CN109697973A (en) | 2019-01-22 | 2019-01-22 | A kind of method, the method and device of model training of prosody hierarchy mark |
Country Status (1)
Country | Link |
---|---|
CN (2) | CN110444191B (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111128120A (en) * | 2019-12-31 | 2020-05-08 | 苏州思必驰信息科技有限公司 | Text-to-speech method and device |
CN111164674A (en) * | 2019-12-31 | 2020-05-15 | 深圳市优必选科技股份有限公司 | Speech synthesis method, device, terminal and storage medium |
CN111261162A (en) * | 2020-03-09 | 2020-06-09 | 北京达佳互联信息技术有限公司 | Speech recognition method, speech recognition apparatus, and storage medium |
CN111710326A (en) * | 2020-06-12 | 2020-09-25 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111754978A (en) * | 2020-06-15 | 2020-10-09 | 北京百度网讯科技有限公司 | Rhythm hierarchy marking method, device, equipment and storage medium |
CN111785247A (en) * | 2020-07-13 | 2020-10-16 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112382270A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN112466277A (en) * | 2020-10-28 | 2021-03-09 | 北京百度网讯科技有限公司 | Rhythm model training method and device, electronic equipment and storage medium |
CN112863476A (en) * | 2019-11-27 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing |
CN113129863A (en) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | Voice time length prediction method, device, equipment and readable storage medium |
CN113421550A (en) * | 2021-06-25 | 2021-09-21 | 北京有竹居网络技术有限公司 | Speech synthesis method, device, readable medium and electronic equipment |
CN113421543A (en) * | 2021-06-30 | 2021-09-21 | 深圳追一科技有限公司 | Data labeling method, device and equipment and readable storage medium |
CN114420089A (en) * | 2022-03-30 | 2022-04-29 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN115116428A (en) * | 2022-05-19 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Prosodic boundary labeling method, apparatus, device, medium, and program product |
CN115188365A (en) * | 2022-09-09 | 2022-10-14 | 中邮消费金融有限公司 | Pause prediction method and device, electronic equipment and storage medium |
CN113129863B (en) * | 2019-12-31 | 2024-05-31 | 科大讯飞股份有限公司 | Voice duration prediction method, device, equipment and readable storage medium |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020218635A1 (en) * | 2019-04-23 | 2020-10-29 | 엘지전자 주식회사 | Voice synthesis apparatus using artificial intelligence, method for operating voice synthesis apparatus, and computer-readable recording medium |
CN110164413B (en) * | 2019-05-13 | 2021-06-04 | 北京百度网讯科技有限公司 | Speech synthesis method, apparatus, computer device and storage medium |
CN110619035B (en) * | 2019-08-01 | 2023-07-25 | 平安科技(深圳)有限公司 | Method, device, equipment and storage medium for identifying keywords in interview video |
CN112528014B (en) * | 2019-08-30 | 2023-04-18 | 成都启英泰伦科技有限公司 | Method and device for predicting word segmentation, part of speech and rhythm of language text |
CN110556093B (en) * | 2019-09-17 | 2021-12-10 | 浙江同花顺智富软件有限公司 | Voice marking method and system |
CN110459202B (en) * | 2019-09-23 | 2022-03-15 | 浙江同花顺智能科技有限公司 | Rhythm labeling method, device, equipment and medium |
CN110675896B (en) * | 2019-09-30 | 2021-10-22 | 北京字节跳动网络技术有限公司 | Character time alignment method, device and medium for audio and electronic equipment |
CN110797005B (en) * | 2019-11-05 | 2022-06-10 | 百度在线网络技术(北京)有限公司 | Prosody prediction method, apparatus, device, and medium |
CN110767213A (en) * | 2019-11-08 | 2020-02-07 | 四川长虹电器股份有限公司 | Rhythm prediction method and device |
WO2021134581A1 (en) * | 2019-12-31 | 2021-07-08 | 深圳市优必选科技股份有限公司 | Prosodic feature prediction-based speech synthesis method, apparatus, terminal, and medium |
CN111369971B (en) * | 2020-03-11 | 2023-08-04 | 北京字节跳动网络技术有限公司 | Speech synthesis method, device, storage medium and electronic equipment |
CN111681641B (en) * | 2020-05-26 | 2024-02-06 | 微软技术许可有限责任公司 | Phrase-based end-to-end text-to-speech (TTS) synthesis |
CN111667816B (en) * | 2020-06-15 | 2024-01-23 | 北京百度网讯科技有限公司 | Model training method, speech synthesis method, device, equipment and storage medium |
CN114064964A (en) * | 2020-07-30 | 2022-02-18 | 华为技术有限公司 | Text time labeling method and device, electronic equipment and readable storage medium |
CN112102847B (en) * | 2020-09-09 | 2022-08-09 | 四川大学 | Audio and slide content alignment method |
CN112216267A (en) * | 2020-09-15 | 2021-01-12 | 北京捷通华声科技股份有限公司 | Rhythm prediction method, device, equipment and storage medium |
CN112863484B (en) * | 2021-01-25 | 2024-04-09 | 中国科学技术大学 | Prosodic phrase boundary prediction model training method and prosodic phrase boundary prediction method |
CN113178188B (en) * | 2021-04-26 | 2024-05-28 | 平安科技(深圳)有限公司 | Speech synthesis method, device, equipment and storage medium |
CN113327615B (en) * | 2021-08-02 | 2021-11-16 | 北京世纪好未来教育科技有限公司 | Voice evaluation method, device, equipment and storage medium |
CN115116427B (en) * | 2022-06-22 | 2023-11-14 | 马上消费金融股份有限公司 | Labeling method, voice synthesis method, training method and training device |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070112570A1 (en) * | 2005-11-17 | 2007-05-17 | Oki Electric Industry Co., Ltd. | Voice synthesizer, voice synthesizing method, and computer program |
US8554566B2 (en) * | 2008-08-12 | 2013-10-08 | Morphism Llc | Training and applying prosody models |
TW201432668A (en) * | 2013-02-05 | 2014-08-16 | Univ Nat Chiao Tung | Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing |
CN105374350A (en) * | 2015-09-29 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Speech marking method and device |
CN105551481A (en) * | 2015-12-21 | 2016-05-04 | 百度在线网络技术(北京)有限公司 | Rhythm marking method of voice data and apparatus thereof |
US20160365087A1 (en) * | 2015-06-12 | 2016-12-15 | Geulah Holdings Llc | High end speech synthesis |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN106971709A (en) * | 2017-04-19 | 2017-07-21 | 腾讯科技(上海)有限公司 | Statistic parameter model method for building up and device, phoneme synthesizing method and device |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7136816B1 (en) * | 2002-04-05 | 2006-11-14 | At&T Corp. | System and method for predicting prosodic parameters |
US20070055526A1 (en) * | 2005-08-25 | 2007-03-08 | International Business Machines Corporation | Method, apparatus and computer program product providing prosodic-categorical enhancement to phrase-spliced text-to-speech synthesis |
CN103035241A (en) * | 2012-12-07 | 2013-04-10 | 中国科学院自动化研究所 | Model complementary Chinese rhythm interruption recognition system and method |
CN105185373B (en) * | 2015-08-06 | 2017-04-05 | 百度在线网络技术(北京)有限公司 | The generation of prosody hierarchy forecast model and prosody hierarchy Forecasting Methodology and device |
CN105244020B (en) * | 2015-09-24 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy model training method, text-to-speech method and text-to-speech device |
CN105185372B (en) * | 2015-10-20 | 2017-03-22 | 百度在线网络技术(北京)有限公司 | Training method for multiple personalized acoustic models, and voice synthesis method and voice synthesis device |
CN107039034B (en) * | 2016-02-04 | 2020-05-01 | 科大讯飞股份有限公司 | Rhythm prediction method and system |
CN108305612B (en) * | 2017-11-21 | 2020-07-31 | 腾讯科技(深圳)有限公司 | Text processing method, text processing device, model training method, model training device, storage medium and computer equipment |
-
2019
- 2019-01-22 CN CN201910751371.6A patent/CN110444191B/en active Active
- 2019-01-22 CN CN201910060152.3A patent/CN109697973A/en active Pending
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070112570A1 (en) * | 2005-11-17 | 2007-05-17 | Oki Electric Industry Co., Ltd. | Voice synthesizer, voice synthesizing method, and computer program |
US8554566B2 (en) * | 2008-08-12 | 2013-10-08 | Morphism Llc | Training and applying prosody models |
TW201432668A (en) * | 2013-02-05 | 2014-08-16 | Univ Nat Chiao Tung | Streaming encoder, prosody information encoding device, prosody-analyzing device, and device and method for speech-synthesizing |
US20160365087A1 (en) * | 2015-06-12 | 2016-12-15 | Geulah Holdings Llc | High end speech synthesis |
CN105374350A (en) * | 2015-09-29 | 2016-03-02 | 百度在线网络技术(北京)有限公司 | Speech marking method and device |
CN105551481A (en) * | 2015-12-21 | 2016-05-04 | 百度在线网络技术(北京)有限公司 | Rhythm marking method of voice data and apparatus thereof |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN106971709A (en) * | 2017-04-19 | 2017-07-21 | 腾讯科技(上海)有限公司 | Statistic parameter model method for building up and device, phoneme synthesizing method and device |
Non-Patent Citations (4)
Title |
---|
CHONGJIA NI ET AL: "AUTOMATIC PROSODY BOUNDARY LABELING OF MANDARIN USING BOTH TEXT AND ACOUSTIC INFORMATION", 《2008 6TH INTERNATIONAL SYMPOSIUM ON CHINESE SPOKEN LANGUAGE PROCESSING》 * |
MU WANG ET AL: "Emphatic Speech Synthesis and Control Based on Characteristic Transferring in End-to-End Speech Synthesis", 《2018 FIRST ASIAN CONFERENCE ON AFFECTIVE COMPUTING AND INTELLIGENT INTERACTION (ACII ASIA)》 * |
李枭: "基于文本和语音特征的汉语韵律短语边界预测", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
苏丹: "基于半监督学习的汉语韵律短语预测研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (24)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112863476A (en) * | 2019-11-27 | 2021-05-28 | 阿里巴巴集团控股有限公司 | Method and device for constructing personalized speech synthesis model, method and device for speech synthesis and testing |
CN113129863A (en) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | Voice time length prediction method, device, equipment and readable storage medium |
CN111164674B (en) * | 2019-12-31 | 2024-05-03 | 深圳市优必选科技股份有限公司 | Speech synthesis method, device, terminal and storage medium |
CN113129863B (en) * | 2019-12-31 | 2024-05-31 | 科大讯飞股份有限公司 | Voice duration prediction method, device, equipment and readable storage medium |
CN111164674A (en) * | 2019-12-31 | 2020-05-15 | 深圳市优必选科技股份有限公司 | Speech synthesis method, device, terminal and storage medium |
CN111128120A (en) * | 2019-12-31 | 2020-05-08 | 苏州思必驰信息科技有限公司 | Text-to-speech method and device |
CN111261162B (en) * | 2020-03-09 | 2023-04-18 | 北京达佳互联信息技术有限公司 | Speech recognition method, speech recognition apparatus, and storage medium |
CN111261162A (en) * | 2020-03-09 | 2020-06-09 | 北京达佳互联信息技术有限公司 | Speech recognition method, speech recognition apparatus, and storage medium |
CN111710326A (en) * | 2020-06-12 | 2020-09-25 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111710326B (en) * | 2020-06-12 | 2024-01-23 | 携程计算机技术(上海)有限公司 | English voice synthesis method and system, electronic equipment and storage medium |
CN111754978A (en) * | 2020-06-15 | 2020-10-09 | 北京百度网讯科技有限公司 | Rhythm hierarchy marking method, device, equipment and storage medium |
CN111754978B (en) * | 2020-06-15 | 2023-04-18 | 北京百度网讯科技有限公司 | Prosodic hierarchy labeling method, device, equipment and storage medium |
CN111785247A (en) * | 2020-07-13 | 2020-10-16 | 北京字节跳动网络技术有限公司 | Voice generation method, device, equipment and computer readable medium |
CN112466277A (en) * | 2020-10-28 | 2021-03-09 | 北京百度网讯科技有限公司 | Rhythm model training method and device, electronic equipment and storage medium |
CN112466277B (en) * | 2020-10-28 | 2023-10-20 | 北京百度网讯科技有限公司 | Prosody model training method and device, electronic equipment and storage medium |
CN112382270A (en) * | 2020-11-13 | 2021-02-19 | 北京有竹居网络技术有限公司 | Speech synthesis method, apparatus, device and storage medium |
CN113421550A (en) * | 2021-06-25 | 2021-09-21 | 北京有竹居网络技术有限公司 | Speech synthesis method, device, readable medium and electronic equipment |
CN113421543A (en) * | 2021-06-30 | 2021-09-21 | 深圳追一科技有限公司 | Data labeling method, device and equipment and readable storage medium |
CN113421543B (en) * | 2021-06-30 | 2024-05-24 | 深圳追一科技有限公司 | Data labeling method, device, equipment and readable storage medium |
CN114420089B (en) * | 2022-03-30 | 2022-06-21 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN114420089A (en) * | 2022-03-30 | 2022-04-29 | 北京世纪好未来教育科技有限公司 | Speech synthesis method, apparatus and computer-readable storage medium |
CN115116428A (en) * | 2022-05-19 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Prosodic boundary labeling method, apparatus, device, medium, and program product |
CN115116428B (en) * | 2022-05-19 | 2024-03-15 | 腾讯科技(深圳)有限公司 | Prosodic boundary labeling method, device, equipment, medium and program product |
CN115188365A (en) * | 2022-09-09 | 2022-10-14 | 中邮消费金融有限公司 | Pause prediction method and device, electronic equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109697973A (en) | 2019-04-30 |
CN110444191B (en) | 2021-11-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110444191A (en) | A kind of method, the method and device of model training of prosody hierarchy mark | |
CN110838286B (en) | Model training method, language identification method, device and equipment | |
CN110491382B (en) | Speech recognition method and device based on artificial intelligence and speech interaction equipment | |
CN110490213B (en) | Image recognition method, device and storage medium | |
WO2018153213A1 (en) | Multi-language hybrid speech recognition method | |
CN110288077A (en) | A kind of synthesis based on artificial intelligence is spoken the method and relevant apparatus of expression | |
CN110853617B (en) | Model training method, language identification method, device and equipment | |
WO2021135611A1 (en) | Method and device for speech recognition, terminal and storage medium | |
CN109145303A (en) | Name entity recognition method, device, medium and equipment | |
CN107943860A (en) | The recognition methods and device that the training method of model, text are intended to | |
CN110853618A (en) | Language identification method, model training method, device and equipment | |
CN110444210A (en) | A kind of method of speech recognition, the method and device for waking up word detection | |
CN109523989A (en) | Phoneme synthesizing method, speech synthetic device, storage medium and electronic equipment | |
CN110890093A (en) | Intelligent device awakening method and device based on artificial intelligence | |
CN110517664A (en) | Multi-party speech recognition methods, device, equipment and readable storage medium storing program for executing | |
US11705105B2 (en) | Speech synthesizer for evaluating quality of synthesized speech using artificial intelligence and method of operating the same | |
CN110334197A (en) | Corpus processing method and relevant apparatus | |
WO2020098269A1 (en) | Speech synthesis method and speech synthesis device | |
CN106649253B (en) | Auxiliary control method and system based on rear verifying | |
KR101627428B1 (en) | Method for establishing syntactic analysis model using deep learning and apparatus for perforing the method | |
US11568853B2 (en) | Voice recognition method using artificial intelligence and apparatus thereof | |
CN112735418B (en) | Voice interaction processing method, device, terminal and storage medium | |
CN111597804B (en) | Method and related device for training entity recognition model | |
CN113393828A (en) | Training method of voice synthesis model, and voice synthesis method and device | |
CN108345612A (en) | A kind of question processing method and device, a kind of device for issue handling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |