CN110534087A - A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium - Google Patents
A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium Download PDFInfo
- Publication number
- CN110534087A CN110534087A CN201910834143.5A CN201910834143A CN110534087A CN 110534087 A CN110534087 A CN 110534087A CN 201910834143 A CN201910834143 A CN 201910834143A CN 110534087 A CN110534087 A CN 110534087A
- Authority
- CN
- China
- Prior art keywords
- word
- hierarchy structure
- prosody hierarchy
- sequence
- structure prediction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L13/00—Speech synthesis; Text to speech systems
- G10L13/08—Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
- G10L13/10—Prosody rules derived from text; Stress or intonation
Abstract
The embodiment of the present application discloses a kind of prosody hierarchy Structure Prediction Methods, device, equipment and storage medium based on artificial intelligence, wherein this method comprises: obtaining target text;Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence;The feature extraction of word grade is carried out according to participle annotated sequence and obtains word grade characteristic sequence, and the word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in the word grade characteristic sequence;The corresponding prosody hierarchy structure sequence of word grade characteristic sequence is obtained by prosody hierarchy structure prediction model, which is based on the deep neural network model from attention mechanism.This method can effectively improve the precision of prediction for prosody hierarchy structure.
Description
Technical field
This application involves voice technology fields, more particularly to based on the text fascicule based on artificial intelligence from attention mechanism
Level structure prediction technique, device, equipment and storage medium.
Background technique
Prosody hierarchy structure is the modeling of the prosodic features such as pause, rhythm to voice, rhythm structure prediction task be
Speech synthesis front end text-processing part determines the rhythm structure type of each function word in sentence according to text feature.
Rhythm structure prediction has great significance to the naturalness of speech synthesis system synthesis sound quality.Rhythm common at present
Rule structure prediction mainly use condition random field (CRF), Recognition with Recurrent Neural Network (Recurrent Neural Network,
RNN it) is modeled, but the performance of modeling of both schemes in practical applications is not high, limits voice to a certain degree
The quality of synthesis.
Summary of the invention
The embodiment of the present application provides a kind of text prosody hierarchy Structure Prediction Methods based on artificial intelligence, device, sets
Standby and storage medium can effectively improve the precision of prediction for prosody hierarchy structure.
In view of this, the application first aspect provides a kind of text prosody hierarchy structure prediction side based on artificial intelligence
Method, comprising:
Obtain target text;
Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence;
The feature extraction of word grade, which is carried out, according to the participle annotated sequence obtains word grade characteristic sequence, institute's predicate grade characteristic sequence
In each word word grade feature include at least through the resulting term vector of semantic feature extraction;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence, institute are obtained by prosody hierarchy structure prediction model
Stating prosody hierarchy structure prediction model is based on the deep neural network model from attention mechanism.
The application second aspect provides a kind of text prosody hierarchy structure prediction device based on artificial intelligence, comprising:
Module is obtained, for obtaining target text;
Participle and part-of-speech tagging module, for the target text carry out participle and part-of-speech tagging obtain participle mark sequence
Column;
Word grade characteristic extracting module obtains word grade feature for carrying out the feature extraction of word grade according to the participle annotated sequence
Sequence, the word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in institute's predicate grade characteristic sequence;
Prosody hierarchy structure prediction module, for obtaining institute's predicate grade characteristic sequence by prosody hierarchy structure prediction model
Corresponding prosody hierarchy structure sequence, the prosody hierarchy structure prediction model are based on the depth nerve net from attention mechanism
Network model.
The application third aspect provides a kind of text prosody hierarchy structure prediction equipment based on artificial intelligence, described to set
Standby includes processor and memory:
The memory is for storing computer program;
The processor be used for according to the computer program execute above-mentioned first aspect described in text prosody hierarchy knot
Structure prediction technique.
The application fourth aspect provides a kind of computer readable storage medium, and the computer readable storage medium is used for
Computer program is stored, the computer program is for executing text prosody hierarchy Structure Prediction Methods described in first aspect.
The 5th aspect of the application provides a kind of computer program product including instruction, when it runs on computers
When, so that the computer executes text prosody hierarchy Structure Prediction Methods described in above-mentioned first aspect.
As can be seen from the above technical solutions, the embodiment of the present application has the advantage that
The embodiment of the present application provides a kind of text prosody hierarchy Structure Prediction Methods, and this method, which utilizes, to be based on from attention
The deep neural network model of mechanism predicts that prosody hierarchy structure, the prediction for effectively improving prosody hierarchy structure is quasi-
Exactness.Specifically, in prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application, after getting target text, to this
Target text carries out participle and part-of-speech tagging handles to obtain participle annotated sequence;Then, word grade is carried out according to participle annotated sequence
Feature extraction obtains word grade characteristic sequence, and the word grade feature of each word is included at least in word grade feature training mentions through semantic feature
Take resulting term vector;In turn, the corresponding prosody hierarchy knot of word grade characteristic sequence is obtained by prosody hierarchy structure prediction model
Structure sequence, the prosody hierarchy structure prediction model based on the deep neural network model from attention mechanism.Above-mentioned fascicule
Level structure prediction technique is used based on the deep neural network model from attention mechanism, to the rhythm respectively segmented in target text
Hierarchical structure is predicted, should respectively be segmented based on that can capture within the scope of full sentence from the deep neural network model of attention mechanism
Between context dependency, compared in the related technology CRF model, RNN model have better Series Modeling ability, because
This, can effectively promote the prediction effect of prosody hierarchy structure, and correspondingly promote the quality of speech synthesis.
Detailed description of the invention
Fig. 1 is that the application scenarios of the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on artificial intelligence show
It is intended to;
Fig. 2 is that a kind of process of the prosody hierarchy Structure Prediction Methods based on artificial intelligence provided by the embodiments of the present application is shown
It is intended to;
Fig. 3 is the work configuration diagram of prosody hierarchy structure prediction model provided by the embodiments of the present application;
Fig. 4 is the flow diagram of the training method of prosody hierarchy structure prediction model provided by the embodiments of the present application;
Fig. 5 is the schematic diagram that scaling dot product attention provided by the embodiments of the present application calculates similarity;
Fig. 6 is the schematic diagram of calculation flow of bull attention mechanism provided by the embodiments of the present application;
Fig. 7 is the operation schematic diagram of fully-connected network sublayer provided by the embodiments of the present application;
Fig. 8 is the schematic diagram of residual error provided by the embodiments of the present application connection;
Fig. 9 is the flow diagram of another prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application;
Figure 10 is the knot of the first prosody hierarchy structure prediction device based on artificial intelligence provided by the embodiments of the present application
Structure schematic diagram;
Figure 11 is the knot of the second provided by the embodiments of the present application prosody hierarchy structure prediction device based on artificial intelligence
Structure schematic diagram;
Figure 12 is the knot of the third prosody hierarchy structure prediction device based on artificial intelligence provided by the embodiments of the present application
Structure schematic diagram;
Figure 13 is the knot of the 4th kind provided by the embodiments of the present application prosody hierarchy structure prediction device based on artificial intelligence
Structure schematic diagram;
Figure 14 is the knot of the 5th kind provided by the embodiments of the present application prosody hierarchy structure prediction device based on artificial intelligence
Structure schematic diagram;
Figure 15 is a kind of service for being used to predict prosody hierarchy structure based on artificial intelligence provided by the embodiments of the present application
The structural schematic diagram of device;
Figure 16 is a kind of terminal for being used to predict prosody hierarchy structure based on artificial intelligence provided by the embodiments of the present application
The structural schematic diagram of equipment.
Specific embodiment
In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application
Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this
Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist
Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.
The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing
The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage
The data that solution uses in this way are interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to
Here the sequence other than those of diagram or description is implemented.In addition, term " includes " and " having " and their any deformation,
Be intended to cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, product or setting
It is standby those of to be not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for these mistakes
The intrinsic other step or units of journey, method, product or equipment.
This application involves the fields artificial intelligence (Artificial Intelligence, AI), below to artificial intelligence field
The relevant technologies are simply introduced.
So-called artificial intelligence is machine simulation, extension and the extension people controlled using digital computer or digital computer
Intelligence, perception environment, obtain knowledge and using Knowledge Acquirement optimum theory, method, technology and application system.Change sentence
It talks about, artificial intelligence is a complex art of computer science, it attempts to understand essence of intelligence, and is produced a kind of new
The intelligence machine that can be made a response in such a way that human intelligence is similar.
Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer
The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage,
The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer
Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.
Wherein, natural language processing (Nature Language processing, NLP) is computer science and people
An important directions in work smart field.It, which studies to be able to achieve between people and computer, carries out efficient communication with natural language
Various theory and methods.Natural language processing is one and melts linguistics, computer science, mathematics in the science of one.Therefore, this
The research in one field will be related to natural language, i.e. people's language used in everyday, so it and philological research have closely
Connection.Natural language processing technique generally includes text-processing, semantic understanding, machine translation, robot question and answer, knowledge mapping
Etc. technologies.
It generallys use CRF, RNN in the related technology to be modeled, and based on this, to the fascicule respectively segmented in text
Level structure is predicted;However, CRF, RNN can not usually capture the dependence within the scope of full sentence between any two word, so
Their modeling ability is limited, and then causes accurately predict prosody hierarchy structure based on them.
For problem present in above-mentioned the relevant technologies, the embodiment of the present application provides a kind of prosody hierarchy knot based on AI
Structure prediction technique, this method are used based on the deep neural network model from attention mechanism, to what is respectively segmented in target text
Prosody hierarchy structure is predicted, should be passed through based on the deep neural network model from attention mechanism therein sub from attention
Layer can preferably capture the context dependency within the scope of full sentence, should compared to CRF model, RNN model in the related technology
Deep neural network model has better Series Modeling ability, correspondingly, compared to CRF model, RNN mould in the related technology
Type, the deep neural network model can obtain better prosody hierarchy structure prediction effect, and then help to promote subsequent language
The quality of sound synthesis.
It should be understood that the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on AI can be applied to have number
According to the equipment of processing capacity, such as terminal device, server;Wherein, terminal device be specifically as follows smart phone, computer,
Personal digital assistant (Personal Digital Assitant, PDA), tablet computer etc.;Server is specifically as follows using clothes
Business device, or Web server, in actual deployment, which can be separate server, or cluster service
Device.
Technical solution provided by the embodiments of the present application in order to facilitate understanding, below with fascicule provided by the embodiments of the present application
Level structure prediction technique is applied to for server, is applicable in prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application
Application scenarios carry out exemplary introduction.
Referring to Fig. 1, Fig. 1 is the application scenarios of the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on AI
Schematic diagram.As shown in Figure 1, the application scenarios include: terminal device 110 and server 120, terminal device 110 and server 120
It is communicated by network.Wherein, terminal device 110 is used to receive the voice signal of user's input, and the voice signal is passed
Transport to server 120;Server 120 is used to determine the corresponding answer text of voice signal that terminal device 110 transmits, and executes
Prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application carry out in advance to the prosody hierarchy structure respectively segmented in text is replied
It surveys, generates and reply the corresponding prosody hierarchy structure sequence of text, in turn, text will be replied according to the prosody hierarchy structure sequence and turned
It is changed to corresponding answer voice signal, is transmitted to terminal device 110.
When concrete application, user can be to 110 input speech signal of terminal device, and being directed to requesting terminal equipment 110 should
Voice signal replys corresponding answer voice signal;After terminal device 110 receives the voice signal of user's input, by the voice
Signal is transmitted through the network to server 120.
After server 120 receives the voice signal of the transmission of terminal device 110, first determine for replying the voice signal
Reply text.In turn, using the answer text as target text, participle is carried out to the target text and part-of-speech tagging handles to obtain
Corresponding participle annotated sequence;Then, the feature extraction of word grade is carried out for the participle annotated sequence obtain word grade characteristic sequence, it should
The word grade feature of each word includes at least the term vector obtained through semantic feature extraction in word grade characteristic sequence;Then, pass through rhythm
Rule hierarchical structure prediction model determines the corresponding prosody hierarchy structure sequence of word grade characteristic sequence, the prosody hierarchy structure prediction
Model is based on the deep neural network model from attention mechanism.
Server 120, can further base after above-mentioned processing determines and replies the corresponding prosody hierarchy structure sequence of text
In the prosody hierarchy structure sequence, generates and reply the corresponding answer voice signal of text, the answer voice signal is more naturally, more
Stick on the pronunciation of person of modern times's class.Finally, to terminal device 110, terminal is set the answer transmitting voice signal that server 120 is generated
Standby 110 play the answer voice signal, realize its human-computer interaction between user.
It should be noted that human-computer interaction application scenarios shown in FIG. 1 are merely illustrative, in practical applications, the application is real
The prosody hierarchy Structure Prediction Methods for applying example offer can also be applied to other scenes, for example, the text conversion that user is uploaded
For the scene etc. of voice, the application scenarios not being applicable in herein prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application are done
Any restriction.
The prosody hierarchy Structure Prediction Methods provided by the present application based on AI are introduced below by embodiment.
Referring to fig. 2, Fig. 2 is a kind of process of the prosody hierarchy Structure Prediction Methods based on AI provided by the embodiments of the present application
Schematic diagram.For ease of description, for following embodiments are using server as executing subject, to the prosody hierarchy structure prediction side
Method is introduced.As shown in Fig. 2, the prosody hierarchy Structure Prediction Methods the following steps are included:
Step 201: obtaining target text.
When server need for target text synthesize its corresponding voice signal when, in order to obtain it is more natural, closer to
The voice signal of human articulation, server can first predict the corresponding rhythm structure level sequence of the target text, in turn, then with
Based on the corresponding rhythm structure level sequence of the target text, the corresponding voice signal of the target text is synthesized.
It should be noted that server can obtain target text by different modes under different application scenarios.
By taking the application scenarios of human-computer interaction as an example, the voice signal that the available terminal device of server is sent, and the voice will be converted
The text that signal obtains is as target text;For carrying out the application scenarios of voice conversion process to text, server can be with
The text to be converted that terminal device is sent can be by it from other servers or database as target text or server
The text to be converted for locating to obtain is as target text;Etc..The mode that the application does not obtain target text to server herein is done
Any restriction.
It should be understood that when prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application are applied to terminal device, terminal
Equipment can also obtain target text by different modes under different application scenarios.With the application scenarios of human-computer interaction
For, the voice signal that user inputs can be converted to corresponding text by terminal device, and then using the text as target text
This;For carrying out the application scenarios of voice conversion process to text, the text conduct of the available user's input of terminal device
The text that target text or the available server transport of terminal device come is as target text;Etc..The application is herein
Also the mode for not obtaining target text to terminal device does any restriction.
Step 202: participle being carried out to the target text and part-of-speech tagging obtains participle annotated sequence.
After server gets target text, word segmentation processing first can be carried out to the target text, obtain target text pair
The segmentation sequence answered;In turn, part-of-speech tagging processing is carried out for each participle in segmentation sequence, it is corresponding obtains target text
Segment annotated sequence.
It should be noted that having more mature participle processing method and part-of-speech tagging method in the related technology, herein
The participle processing method that can be directly used in the related technology carries out word segmentation processing to target text, and using in the related technology
Part-of-speech tagging method segmentation sequence that word segmentation processing is obtained carry out part-of-speech tagging processing, the application does not use herein to specific
Participle processing method and part-of-speech tagging method do any restriction.
Step 203: the feature extraction of word grade being carried out according to the participle annotated sequence and obtains word grade characteristic sequence, institute's predicate grade
The word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in characteristic sequence.
It, can be to each word in the participle annotated sequence after server gets the corresponding participle annotated sequence of target text
The feature extraction of word grade is carried out in turn, the word grade feature of each word is sequentially combined to obtain the word grade feature of each word,
Obtain the corresponding word grade characteristic sequence of target text.The word grade feature of each word is included at least through language in word grade characteristic sequence herein
The term vector that adopted feature extraction obtains.
It should be noted that in practical applications, in order to further enhance the prediction effect for prosody hierarchy structure, often
The word grade feature of a word can also include the corresponding position of each word other than including the term vector obtained through semantic feature extraction
At least one after vector, part of speech vector, word long vector and word in punctuate vector is protected in this way, enriching the word grade feature of each word
Demonstrate,proving subsequent can more accurately predict prosody hierarchy structure based on word grade feature abundant.
When specific implementation, server can carry out semantic feature extraction to participle annotated sequence, and it is corresponding to obtain each word
Term vector;The location information of word each in text is encoded, the corresponding position vector of each word is obtained;In turn, according to point
In word annotated sequence after the corresponding position vector of each word, part of speech vector, word long vector and word in punctuate type vector at least
The term vector of one and each word, generates the corresponding word grade feature of each word;Finally, by each word pair in participle annotated sequence
The word grade characteristic sequence answered combines, and obtains word grade characteristic sequence.
When specific extraction semantic feature, server can carry out language to participle annotated sequence by semantic feature extraction model
Adopted feature extraction obtains the corresponding term vector of each word;That is, server can will segment annotated sequence in each word it is defeated one by one
Enter to semantic feature extraction model, to obtain the corresponding term vector of each word of semantic feature extraction model output.
It should be noted that in order to enable the term vector obtained through semantic feature extraction is rich in more semantic features, clothes
Pre-trained alternating binary coding device, which can be used, in business device indicates (BERT) network structure or Skip-Gram network structure, as
Above-mentioned semantic feature extraction model.Above-mentioned semantic feature extraction model (i.e. BERT network structure or Skip-Gram network knot
Structure) it is that pre-training obtains on Big-corpus, the term vector extracted is generally rich in contextual information, has preferable language
Adopted feature carries out subsequent prosody hierarchy structure prediction based on the term vector, can aid in and promote the pre- of prosody hierarchy structure
Survey effect.
It should be understood that in practical applications, server can also be using other network structures as above-mentioned semantic feature extraction
Model, the application are not specifically limited the model structure of semantic feature extraction model herein.
Although based on any distance can be learnt into full sentence from the prosody hierarchy structure prediction model of attention mechanism
Dependence between word, but the relative position distance between word is but ignored because from attention mechanism, in order to make
Can utilize relative position information based on the prosody hierarchy structure prediction model from attention mechanism is subsequent, the application adopt herein
With the mechanism of time signal (timing signal), the location information of word each in target text is encoded to obtain every
The corresponding position vector of a word directly can use formula (1) and (2) to encode location information without learning any parameter:
Wherein, t is the index value of time step, and 2i and 2i+1 are the dimension index value of coding, and d is position encoded dimension.
Part of speech vector is used to indicate the part of speech of word, such as indicates the vector that part of speech is noun, the vector that expression part of speech is verb
Etc..Word long vector is used to indicate the number of words for including in word, as indicated to include the vector of two words in word, indicating in word
Including triliteral vector etc..Punctuate type vector is used for punctuate when whether having punctuate after indicating word and have punctuate after word
Type, if there is punctuate after word, punctuate type vector is the vector for indicating the punctuate type after the word after word, if nothing after word
Punctuate, then punctuate type vector is for indicating the vector without punctuate after the word after word.Above-mentioned part of speech vector, word long vector and
Punctuate type vector can be indicated using solely hot vector after word.
It is " to show best regards and fine wish with target text." for, segmentation sequence is that " it is sincere to show
It greets and fine wish.", last corresponding part of speech vector of word " wish " in the text be indicate part of speech be noun to
Amount, corresponding word long vector are the vector indicated in word comprising two words, and punctuate type vector is to mark after indicating word after word
Point is the vector of fullstop.
It should be understood that in practical applications, server can according to actual needs, for each word in participle annotated sequence
At least one of punctuate type vector after above-mentioned position vector, part of speech vector, word long vector and word is generated, in turn, by institute
Word of at least one of the punctuate type vector with each word after the position vector of generation, part of speech vector, word long vector and word
Vector combines, and generates the corresponding word grade feature of each word.
In one possible implementation, position vector, part of speech vector, word are generated for each word in server
After long vector and word in the case where punctuate type vector, server can be for each word in participle annotated sequence, by each word
Corresponding term vector and position vector are summed, by this and value part of speech vector corresponding with each word, word long vector and word
Punctuate type vector carries out vector splicing afterwards, obtains the corresponding word grade feature of each word.
For example, it is assumed that server is carried out semantic for i-th of word in participle annotated sequence using semantic feature extraction model
It is e that feature extraction, which obtains its corresponding term vector,i, by term vector eiIt is encoded with the location information to i-th of word
Position vector is summed, and then, by this and is worth and text feature set riCarry out vector splicing, text feature set riBy i-th
Punctuate type vector is spliced to form after the part of speech vector of a word, word long vector and word.
In this way, during predicting prosody hierarchy structure, it will be closely related with prosody hierarchy structure type
In factor is considered in, the semantic feature referred to during prosody hierarchy structure prediction is enriched, and then guarantee prosody hierarchy knot
The accuracy of structure prediction.
After the corresponding word grade feature of word each in above-mentioned processing acquisition participle annotated sequence, in turn, marked according to participle
Each word puts in order in sequence, the corresponding word grade characteristic sequence of each word is combined, it is corresponding to obtain target text
Word grade characteristic sequence.
Step 204: the corresponding prosody hierarchy knot of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model
Structure sequence, the prosody hierarchy structure prediction model are based on the deep neural network model from attention mechanism.
After server generates the corresponding word grade characteristic sequence of target text, prosody hierarchy structure prediction model is further utilized
The word grade characteristic sequence is handled, to obtain the corresponding prosody hierarchy structure sequence of target text, the prosody hierarchy knot
Structure prediction model is based on the deep neural network model from attention mechanism.
In one possible implementation, the network structure of above-mentioned prosody hierarchy structure prediction model includes cascade complete
Articulamentum, N (N is positive integer) a characteristic processing layer and normalization layer;Wherein, characteristic processing layer specifically include non-linear sublayer and
From attention sublayer.
Referring to Fig. 3, Fig. 3 is a kind of work of illustrative prosody hierarchy structure prediction model provided by the embodiments of the present application
Configuration diagram.The word grade characteristic sequence of input prosody hierarchy structure prediction model is represented by W=(w1,w2,…,wi,…,
wn), wherein wiIt for the word grade feature of i-th of word in word grade characteristic sequence, generates: being instructed in advance using warp in the following manner
Experienced semantic feature extraction model carries out semantic feature extraction to i-th of word and obtains term vector, by the term vector and position vector
It sums, then spells this and value with the vector being spliced to form by punctuate type vector after part of speech vector, word long vector and word
It picks up and, the word grade feature of i-th of word can be obtained.
The full articulamentum of prosody hierarchy structure prediction model front end is mixed for the word grade characteristic sequence of input to be carried out feature
It closes to acquire the high-rise expression of feature, gathers into folds to form depth network followed by N number of identical characteristic processing layer heap, wherein often
A characteristic processing layer is constituted by non-linear sublayer and from attention sublayer, and input, the output of sublayer are connected using residual error
Structure, the output par, c of each sublayer also standardizes layer, and the last layer is using normalization (i.e. softmax layers) output mesh of layer
Mark the probability distribution of the rhythm structure type of each word in text.
It should be noted that in practical applications, the value of N can be set according to actual needs, do not done specifically to N value herein
It limits.
It, will be to rhythm shown in Fig. 3 when the subsequent training method to prosody hierarchy structure prediction model of the application is introduced
Each layer network structure is introduced in detail in rule hierarchical structure prediction model, referring particularly to prosody hierarchy structure prediction mould hereinafter
Related content in the training method of type, details are not described herein again.
It should be understood that in practical applications, it can be according to actual needs using other network structures as prosody hierarchy structure
The structure of prediction model, prosody hierarchy structure prediction model shown in Fig. 3 is merely illustrative, and the application is not herein to prosody hierarchy knot
The specific structure of structure prediction model does any restriction.
In one possible implementation, word grade characteristic sequence is input to prosody hierarchy structure prediction mould by server
Type, the prosody hierarchy structure prediction model are four disaggregated models, are used to predict that each word in text to belong to non-rhythm structure side
Boundary (not a boundary, NB), rhythm word boundary (prosodic word, PW), prosodic phrase boundary (prosodic
Phrase, PPH), the probability of intonation phrasal boundary (intonational phrase, IPH);In turn, it is special to obtain word grade for server
The corresponding prosody hierarchy structure sequence of sequence is levied, includes that each word and each word are corresponding general in the prosody hierarchy structure sequence
The maximum prosody hierarchy structure type mark of rate.
Specifically, after word grade characteristic sequence is inputted the prosody hierarchy structure prediction model by server, the prosody hierarchy knot
Structure prediction model will correspondingly predict that each word in target text belongs to the probability of NB, PW, PPH, IPH.Then each word is determined
The prosody hierarchy structure type of corresponding maximum probability identifies, and it is corresponding that prosody hierarchy structure type mark can characterize the word
Prosody hierarchy structure;In turn, each word corresponding prosody hierarchy structure type mark arranged in sequence is got up, can be obtained with it is defeated
The corresponding prosody hierarchy structure sequence of word grade characteristic sequence entered.
In alternatively possible implementation, word grade characteristic sequence is inputted prosody hierarchy structure prediction mould by server
Type, the prosody hierarchy structure prediction model are three disaggregated models, are used to predict that each word in text to belong to non-rhythm structure side
Boundary (NB), rhythm word boundary (PW) and prosodic phrase boundary (PPH), probability;In turn, server obtains the word grade characteristic sequence
Corresponding prosody hierarchy structure sequence includes each word and the corresponding maximum probability of each word in the prosody hierarchy structure sequence
Prosody hierarchy structure type mark.
Specifically, after word grade characteristic sequence is inputted the prosody hierarchy structure prediction model by server, the prosody hierarchy knot
Structure prediction model will correspondingly predict that each word in target text belongs to the probability of NB, PW and PPH.Then determine that each word is corresponding
Maximum probability prosody hierarchy structure type mark, the prosody hierarchy structure type mark can characterize the corresponding rhythm of the word
Hierarchical structure;In turn, the corresponding prosody hierarchy structure type mark arranged in sequence of each word is got up, is can be obtained and input
The corresponding prosody hierarchy structure sequence of word grade characteristic sequence.
It should be noted that when prosody hierarchy structure prediction model is three disaggregated model, in addition to can be used for predicting
Word belongs to outside the probability of NB, PW and PPH in target text, can be also used for the probability that prediction word belongs to PW, IPH and PPH, certainly
It can be used for the probability that prediction word belongs to other three kinds of prosody hierarchy structures, it is not pre- to the prosody hierarchy structure of three classification herein
It surveys the foreseeable three kinds of prosody hierarchy structures of model and does any restriction.
In another possible implementation, it is pre- that word grade characteristic sequence can will be inputted prosody hierarchy structure by server
Model is surveyed, which is two disaggregated models, is used to predict that each word in target text to belong to the rhythm
The probability of phrasal boundary (PPH) and non-prosodic phrase boundary;In turn, server obtains the corresponding prosody hierarchy of word grade characteristic sequence
Structure sequence includes the prosody hierarchy structure of each word and the corresponding maximum probability of each word in the prosody hierarchy structure sequence
The mark of type.
Specifically, after word grade characteristic sequence is inputted the prosody hierarchy structure prediction model by server, the prosody hierarchy knot
Structure prediction model will correspondingly predict that each word in target text belongs to the probability of PPH and non-PPH.When the corresponding maximum probability of word
Prosody hierarchy structure type when being identified as PPH, characterize the word and belong to PPH, when the prosody hierarchy knot of the corresponding maximum probability of word
When structure type identification is non-PPH, characterizes the word and belong to non-PPH;In turn, the corresponding prosody hierarchy structure type of each word is identified
Arranged in sequence gets up, and prosody hierarchy structure sequence corresponding with the word grade characteristic sequence of input can be obtained.
It should be noted that when prosody hierarchy structure prediction model is two disaggregated model, in addition to can be used for predicting
Word belongs to outside the probability of PPH and non-PPH in target text, can be also used for the probability that prediction word belongs to IPH and non-IPH, can also
With for predicting that word belongs to the probability of PW and non-PW, the prosody hierarchy structure prediction model that do not classify herein to two is foreseeable
Two kinds of prosody hierarchy structures do any restriction.
The corresponding prosody hierarchy structure sequence of target text can be obtained after the processing of step 201 to step 204 in server
Column, in turn, server can according to the corresponding prosody hierarchy structure sequence of the target text, target sound type, target word speed,
Target volume and target sampling rate carry out speech synthesis processing, to obtain the corresponding target voice of target text.
It should be understood that above-mentioned target sound type, target word speed, target volume and target sampling rate can be set by individual subscriber
It sets, or the parameter of speech synthesis system default setting, herein not to above-mentioned target sound type, target word speed, target
Volume and target sampling rate set-up mode and specific value do any restriction.
The above-mentioned prosody hierarchy Structure Prediction Methods based on AI are used based on the deep neural network mould from attention mechanism
Type predicts the prosody hierarchy structure respectively segmented in target text, is somebody's turn to do based on the deep neural network from attention mechanism
Model is by the context dependency therein that can preferably capture from attention sublayer within the scope of full sentence, compared to related skill
CRF model, RNN model in art, the deep neural network model have better Series Modeling ability, correspondingly, compared to
CRF model, RNN model in the related technology, the deep neural network model can obtain better prosody hierarchy structure prediction
Effect, and then help to be promoted the quality of subsequent voice synthesis.
It should be understood that in practical applications, the prosody hierarchy Structure Prediction Methods energy provided by the embodiments of the present application based on AI
It is no accurately to predict the corresponding prosody hierarchy structure of target text, depend primarily on the model of prosody hierarchy structure prediction model
Can, and the close phase of training process of the model performance of the prosody hierarchy structure prediction model and the prosody hierarchy structure prediction model
It closes.It is introduced below by training method of the embodiment to prosody hierarchy structure prediction model provided by the present application.
Referring to fig. 4, Fig. 4 is that the process of the training method of prosody hierarchy structure prediction model provided by the embodiments of the present application is shown
It is intended to.For ease of description, for following embodiments are using server as executing subject, to the prosody hierarchy structure prediction model
Training method be introduced.Referring to fig. 4, the prosody hierarchy structure prediction model training method the following steps are included:
Step 401: obtain training sample set, the training sample set include each training sample and with each training sample
Corresponding prosody hierarchy structure label.
Before being trained to prosody hierarchy structure prediction model, it usually needs obtain a large amount of training sample, and every
A corresponding prosody hierarchy structure label of training sample, to form the training for being used for training rhythm hierarchical structure prediction model
Sample set.
It should be noted that the prosody hierarchy structure label for training sample mark and the prosody hierarchy knot to be trained
The type of structure prediction model is closely related;When prosody hierarchy structure prediction model be for predict in text each word belong to PW,
When four disaggregated model of the probability of PPH, IPH and NB, server should include for the prosody hierarchy structure of training sample mark
These four prosody hierarchy structure types of PW, PPH, IPH and NB mark;When prosody hierarchy structure prediction model is for predicting text
In each word probability for belonging to NB, PW and PPH three disaggregated models when, server is directed to the prosody hierarchy structure of training sample mark
It should include that these three prosody hierarchy structure types of NB, PW and PPH identify;When prosody hierarchy structure prediction model is for predicting
When each word belongs to two disaggregated model of the probability of PPH and non-PPH in text, server is directed to the prosody hierarchy of training sample mark
Structure should include that both prosody hierarchy structure types of PPH and non-PPH identify;And so on.
Optionally, in order to allow prosody hierarchy structure prediction model learning to certain uncertainty, help to be promoted
The generalization ability of the prosody hierarchy structure prediction model promotes the prediction effect of prosody hierarchy structure prediction model, and server can
To concentrate the corresponding prosody hierarchy structure label of each training sample to carry out label smoothing processing training sample.
By taking the corresponding prosody hierarchy structure label of training sample includes PW, PPH, IPH and NB these four types of as an example, it is assumed that one
Function word belongs to IPH, then the corresponding prosody hierarchy structure label of the function word is expressed as follows with only hot vector:
TAGIPH=(0,0,0,1)
After carrying out label smoothing processing to the prosody hierarchy structure label, that is, joined noise, certain journey is introduced
The uncertainty of degree, it is assumed that smooth value is set as 0.1, then the label vector after label smoothing processing is expressed as follows:
SMOOTHIPH=(0.03,0.03,0.03,0.9)
In this way, before being trained to prosody hierarchy structure prediction model, it can be to the corresponding rhythm of all training samples
Rule hierarchical structure label is smoothed, allow in training process prosody hierarchy structure prediction model acquire it is certain not
Certainty.
Step 402: being joined by the training sample set to based on the deep neural network model from attention mechanism
Number training, using the trained deep neural network model based on attention mechanism as the prosody hierarchy structure prediction
Model.
After getting training sample set, server can use acquired training sample set, to construct in advance based on
Carry out parameter training from the deep neural network model of attention mechanism, until training obtain meeting trained termination condition based on
From the deep neural network model of attention mechanism, in turn, by this based on the deep neural network model work from attention mechanism
For prosody hierarchy structure prediction model, can be put into practical application.
Below constructed in advance based on the deep neural network model from attention mechanism as model structure shown in Fig. 3
For, the training method of prosody hierarchy structure prediction model is introduced;Below first to this based on the depth from attention mechanism
Being introduced respectively from attention sublayer, non-linear sublayer and residual error connection type in degree neural network model:
Attention mechanism can be regarded as an inquiry (query) and a series of keys (key) value (value) and look into obtaining this
The expression ask, concrete processing procedure are as follows: the inquiry and each key be subjected to similarity calculation and obtain a series of weights,
Then weight is carried out to corresponding value to sum to obtain the expression of the inquiry.It is specific can be using common similar when calculating similarity
Calculation method, such as additivity attention, dot product attention are spent, below using the scaling dot product attention in dot product attention
For (Scaled dot-product attention), similarity calculation process is introduced.
It is the flow diagram that similarity is calculated using scaling dot product attention mechanism referring to Fig. 5, Fig. 5, wherein Q is to look into
Sequence is ask, K is a series of key, and V is value corresponding to key.As shown in figure 5, Q and the advanced row matrix multiplying of K, then lead to
Cross scaling factor and carry out scaling transformation, operation then is normalized to the transformed result of scaling, eventually by with V into
Row matrix multiplication is exported.Specific calculating process can be expressed as formula (3):
Wherein, d is scaling factor, and Q is the dimension of vector.
And only need a sequence that can calculate the expression of this sequence from attention mechanism.Bull attention is to pass through
Inquiry, key, value carry out h linear transformation, then concurrently carry out scaling dot product, and each scaling dot product can obtain a dvDimension
It indicates, in turn, by by h dvThe value of dimension is spliced, and h*d is formedvVector obtain one output, bull attention mechanism
Specific calculation process it is as shown in Figure 6.
Shown in the calculation formula such as formula (4) and formula (5) of bull attention mechanism:
MultiHead (Q, K, V)=Concat (head1,..., headh)W (5)
Wherein,Be respectively inquiry, key, value it is linear
Transformation matrix,To splice the last time matrix of a linear transformation that each scaling dot product output valve is done.In this Shen
Please embodiment provide prosody hierarchy structure prediction model in, the number of bull can be set to 8, for each head can will
Parameter setting is d=256, dk=dv=d/h=64;Certainly, in practical applications, can also set according to actual needs above-mentioned
Parameter is not specifically limited above-mentioned parameter herein.
The application has done further explorative research to from attention mechanism, will be applied to prosody hierarchy from attention mechanism
Structure prediction.Specifically, passing through realizing in prosody hierarchy structure prediction model from attention sublayer, master from attention mechanism
It is used to capture the context dependency within the scope of full sentence between each word, forms the word grade mark sheet for being rich in contextual information
Show, the rhythm structure of higher levels is possible to dependent on word apart from each other, with from attention mechanism primarily to capturing
Dependence between word apart from each other, to facilitate the prediction effect of promotion prosody hierarchy structure.Assuming that a sentence
In have T word, calculate the last one word in this character representation need the semantic feature based on words all in this, by asking
Their similarity obtains the weighted value of each word, then obtains the character representation of the word in the form that weight is summed.
Compared to CRF, RNN model in the related technology, from attention mechanism can directly capture beginning of the sentence word and sentence tail word it
Between dependence, it is insensitive for the distance between two words;And CRF, RNN model need to carry out T-1 calculating and could be formed
The feature of last word inputs, and just study is to the dependence between beginning of the sentence word and sentence tail word, in addition, following by so multiple
When ring calculates last word of arrival, still it cannot be guaranteed that remaining with the complete information of beginning of the sentence word.Based on the depth from attention mechanism
Neural network model compares CRF, RNN model, the dependence being more advantageous between study word apart from each other, and prosody hierarchy
The prediction of intonation phrasal boundary in structure prediction tends to rely on a upper intonation phrasal boundary apart from each other, from attention
Mechanism depends on the input of each word of full sentence, and calculation is more advantageous to the structural information that sentence entirety is arrived in study.
It is adapted to different use demands, it can be using fully-connected network sublayer or circulation nerve net in non-linear sublayer
String bag layer;Specifically, when pursuing faster training speed, can using fully-connected network sublayer as non-linear sublayer, when
When pursuing higher predictablity rate, Recognition with Recurrent Neural Network sublayer can be used.Separately below to these two types of non-linear sublayers into
Row is introduced:
When non-linear sublayer is fully-connected network sublayer, as shown in fig. 7, the fully-connected network sublayer can be with attention certainly
Power network sub-layer is used in combination, and carries out nonlinear transformation will input, is substantially carried out linear transformation twice, wherein centre one
Layer uses line rectification (Rectified Linear Unit, ReLU) activation primitive;Shown in specific calculation process such as formula (6):
FFN (X)=ReLU (XW1)W2 (6)
Wherein, W1∈Rd×d, W2∈Rd×dThe parameter of required study when to train fully-connected network sublayer.
When non-linear sublayer is Recognition with Recurrent Neural Network sublayer, because it is word grade is special that prosody hierarchy structure prediction, which is input,
Sequence is levied, output is corresponding prosody hierarchy structure sequence, then be in fact exactly sequence to the mapping between sequence, though RNN
So suitable Series Modeling, but when sequence is longer, RNN can lead to training difficulty because of gradient explosion or gradient dispersion, band
The RNN of door control mechanism is the method for a relatively good above-mentioned training problem of solution, has been primarily due to shot and long term memory network
(Long Short-Term Memory, LSTM) unit and its variant GRU unit, GRU unit have more compared to LSTM unit
Succinct door, and parameter is few, model is restrained faster, since RNN only has unidirectional contextual information, it is therefore desirable to using double
Two-way contextual information is obtained to RNN.
Although the RNN with door can learn the dependence to past tense spacer step, since a side can only be acquired
Upward information so that its performance is restricted, two-way RNN enable to network acquire the context in both direction according to
Therefore the relationship of relying can apply two-way RNN network structure in prosody hierarchy structure prediction model in this application, therefore,
Non-linear sublayer RNN can have following configuration:
1, unidirectional GRU-RNN sublayer;
2, two-way LSTM-RNN sublayer, i.e., two-way length memory unit (Bidirectional Long Short-Term in short-term
Memory, BLSTM);
3, two-way GRU-RNN sublayer, i.e. two-way tape gating cycle unit (Bidirectional Gated Recurrent
Unit, BGRU).
Due to there are residual error connection, needing to keep data input dimension and output dimension phase between the input and output of sublayer
Together, therefore, when using unidirectional GRU-RNN sublayer, it is 256 dimensions that neuron number, which can be set, when using BLSTM or BGRU
When layer, the neuron number of each direction setting is 128 dimensions, and two-way output is spliced to form 256 dimensions.
Deep neural network model can have showing for saturation even decline in the accuracy rate of training set with the increase of the number of plies
As, here it is the degenerate problem of neural network model, residual error connection is the effective ways of a trained deep neural network model,
Its implementation specifically can there are residual error connections between each sublayer inside characteristic processing layer, and in junction to each dimension
Add operation is carried out, concrete operations process is as shown in Figure 8.
In each sublayer of prosody hierarchy structure prediction model that residual error connection is used in the application, formula (7) table can be used
Show calculating process:
Y=X+SubLayer (X) (7)
Wherein, X and Y respectively indicates outputting and inputting for each sublayer.
After residual error connection, it can also further pass through layer standardized operation, to control distribution between layers;This Shen
Please in prosody hierarchy structure prediction model need to stack repeatedly identical characteristic processing layer, increase with number is stacked, mould
The increase of moldeed depth degree also brings along and is difficult to trained problem, connected by residual error, can aid in prosody hierarchy knot in the application
The training of structure prediction model is more favorable to attempt deeper network structure configuration.
It should be understood that be trained based on the deep neural network model from attention mechanism in addition to that can be shown in Fig. 3
Model structure outside, can also be other model structures, the application is not herein to being trained based on from attention mechanism
The structure of deep neural network model does any restriction.
When whether the trained deep neural network model of specific judgement meets trained termination condition, it can use test specimens
This verifies the first model, which is the training sample using training sample concentration to deep neural network model
Carry out the model that first round training obtains;Specifically, test sample is inputted first model by server, first model is utilized
The test sample of input is correspondingly handled, the corresponding prosody hierarchy structure of the test sample is obtained;In turn, according to test
Sample it is corresponding mark prosody hierarchy structure and the first model output as a result, determine first model predictablity rate, when
It, i.e., can be true it is believed that the working performance of first model preferably meet demand when the predictablity rate is greater than preset threshold
Fixed first model be the deep neural network model for meeting trained termination condition, can using the deep neural network model as
Prosody hierarchy structure prediction model.
Moreover, it is judged that when whether above-mentioned deep neural network model meets trained termination condition, it can also be according to more trainings in rotation
The multiple models got, it is determined whether continue to be trained the deep neural network model, it is optimal to obtain working performance
Prosody hierarchy structure prediction model.Specifically, can use test sample respectively to the multiple depth got through more trainings in rotation
Neural network model is verified, however, it is determined that the difference between the predictablity rate for the deep neural network model that each training in rotation is got
Away from smaller, then it is assumed that the model performance of the deep neural network model without room for promotion, can choose predictablity rate most
High deep neural network model is as the prosody hierarchy structure prediction model for meeting training termination condition;If it is determined that each wheel training
Gap between the predictablity rate of obtained deep neural network model is larger, then it is assumed that the property of the deep neural network model
It can continue to be trained the deep neural network model there are also room for promotion, until it is most stable and optimal to obtain performance
Deep neural network model, as prosody hierarchy structure prediction model.
The training method of above-mentioned prosody hierarchy structure prediction model, using acquired training sample set, to preparatory building
Based on parameter training is carried out from the deep neural network model of attention mechanism, in turn, trained will be based on from attention
The deep neural network model of mechanism puts into practical application as prosody hierarchy structure prediction model.It should be based on from attention machine
The deep neural network model of system by it is therein from attention sublayer can preferably capture the context within the scope of full sentence according to
The relationship of relying, compared to CRF model, RNN model in the related technology, which has better Series Modeling energy
Power, correspondingly, compared to CRF model, RNN model in the related technology, which can be obtained preferably
Prosody hierarchy structure prediction effect, and then help to be promoted the quality of subsequent voice synthesis.
For the ease of further understanding the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on AI, below
It is applied to the application scenarios of target text synthesis target voice for transmitting user, and prosody hierarchy structure prediction mould in this way
For type is four disaggregated models, whole exemplary introduction is done to the prosody hierarchy Structure Prediction Methods.It is the base referring to Fig. 9, Fig. 9
In the flow diagram of the prosody hierarchy Structure Prediction Methods of AI.
When user needs target text " showing best regards and fine wish " synthesizing its corresponding target voice
When, " showing best regards and fine wish " can be input to terminal device by user, to pass through terminal device for the mesh
File Transfer is marked to server;After server gets the target text, participle and part-of-speech tagging first are carried out to the target text
Processing, obtains the corresponding participle annotated sequence of the target text.
Then, server participle annotated sequence corresponding to target text carries out word grade feature extraction processing, is corresponded to
Word grade characteristic sequence.Participle annotated sequence progress semantic feature is mentioned specifically, server can use BERT network structure
It takes, obtains the corresponding term vector of each word in participle annotated sequence;And the location information of word each in target text is carried out
Coding obtains the corresponding position vector of each word in participle annotated sequence;In turn, for participle annotated sequence in each word,
Its corresponding term vector is summed with position vector, by this and value part of speech vector corresponding with the word, word long vector and
Punctuate type vector carries out vector splicing after word, obtains the corresponding word grade feature of the word;Finally, according to each in participle annotated sequence
A word puts in order, and the corresponding word grade characteristic sequence of word each in participle annotated sequence is combined, word grade feature is obtained
Sequence.
Next, word grade characteristic sequence generated is input to prosody hierarchy structure prediction model, the rhythm by server
Hierarchical structure prediction model is based on the deep neural network model from attention mechanism, at the prosody hierarchy prediction model
Each word in generation target text is belonged to the probability of NB, PW, PPH and IPH by reason;In turn, true for each word in target text
The corresponding prosody hierarchy structural identification of maximum probability is determined, as the corresponding prosody hierarchy structure of the word, for example, coming for " showing "
It says, belongs to the maximum probability of PPH, then can determine that " showing " corresponding prosody hierarchy structure is PPH, in another example, for " really
It is sincere " for, belong to the maximum probability of NB, then can determine that " sincere " corresponding prosody hierarchy structure is NB;Etc..Finally,
The corresponding prosody hierarchy structural identification arranged in sequence of word each in identified target text is got up, it is corresponding to obtain target text
Prosody hierarchy structure sequence be " show<PPH>sincere<NB><PW>greet<PW>wish of<IPH>and<PW>fine<NB><
IPH>”。
In turn, server can be set based on the corresponding prosody hierarchy structure sequence of target text that it is determined in conjunction with user
Fixed target sound type, target word speed, target volume and target sampling rate generates the corresponding target voice of target text.And
The target voice is transmitted to terminal device, to play the target voice by terminal device.
For the above-described prosody hierarchy Structure Prediction Methods based on AI, AI is based on present invention also provides corresponding
Prosody hierarchy structure prediction device so that the above-mentioned prosody hierarchy Structure Prediction Methods based on AI in practice application and
It realizes.
Referring to Figure 10, Figure 10 is the corresponding a kind of base of prosody hierarchy Structure Prediction Methods based on AI shown in figure 2 above
In the structural schematic diagram of the prosody hierarchy structure prediction device 1000 of AI, which includes:
Module 1001 is obtained, for obtaining target text;
Participle and part-of-speech tagging module 1002, for the target text carry out participle and part-of-speech tagging obtain participle mark
Infuse sequence;
Word grade characteristic extracting module 1003 obtains word grade for carrying out the feature extraction of word grade according to the participle annotated sequence
Characteristic sequence, in institute's predicate grade characteristic sequence the word grade feature of each word include at least through the resulting word of semantic feature extraction to
Amount;
Prosody hierarchy structure prediction module 1004, for obtaining institute's predicate grade feature by prosody hierarchy structure prediction model
The corresponding prosody hierarchy structure sequence of sequence, the prosody hierarchy structure prediction model are based on the depth mind from attention mechanism
Through network model.
It optionally, is the application referring to Figure 11, Figure 11 on the basis of prosody hierarchy structure prediction device shown in Fig. 10
The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, wherein institute's predicate grade characteristic extracting module
1003 include:
Semantic feature extraction submodule 1101, it is each for being obtained to participle annotated sequence progress semantic feature extraction
The corresponding term vector of word;
Position vector encoding submodule 1102 encode for the location information to each word in the target text
To the corresponding position vector of each word;
Word grade feature generate submodule 1103, for according to the corresponding position of word each in the participle annotated sequence to
At least one of punctuate type vector term vector corresponding with each word generates every after amount, part of speech vector, word long vector and word
The corresponding word grade feature of a word;
Submodule 1104 is combined, for combining the corresponding word grade characteristic sequence of word each in the participle annotated sequence
To institute's predicate grade characteristic sequence.
Optionally, shown in Figure 11 on the basis of prosody hierarchy structure prediction device, the semantic feature extraction submodule
Block 1101 is specifically used for:
It is corresponding that each word is obtained to participle annotated sequence progress semantic feature extraction by semantic feature extraction model
Term vector;Wherein, the semantic feature extraction model uses BERT network structure or Skip-Gram network structure.
Optionally, shown in Figure 11 on the basis of prosody hierarchy structure prediction device, institute's predicate grade feature generates submodule
Block 1103 is specifically used for:
For each word in the participle annotated sequence, the corresponding term vector of each word and position vector are summed,
Will after part of speech vector corresponding with each word with value, word long vector and word punctuate type vector carry out vector splice to obtain it is each
The corresponding word grade feature of word.
It optionally, is the application referring to Figure 12, Figure 12 on the basis of prosody hierarchy structure prediction device shown in Fig. 10
The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, the device further include:
Sample acquisition module 1201, for obtaining training sample set, the training sample set include each training sample and
Prosody hierarchy structure label corresponding with each training sample;
Training module 1202, for by the training sample set to based on the deep neural network mould from attention mechanism
Type carries out parameter training, using the trained deep neural network model based on from attention mechanism as the fascicule
Level structure prediction model.
It optionally, is the application referring to Figure 13, Figure 13 shown in Figure 12 on the basis of prosody hierarchy structure prediction device
The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, the device further include:
Label smoothing module 1301, for concentrating the corresponding prosody hierarchy of each training sample to the training sample
Structure label carries out label smoothing processing;
Then the training module 1202 is specifically used for:
By the training sample set after label Balance Treatment to the deep neural network based on attention mechanism
Model carries out parameter training.
Optionally, on the basis of prosody hierarchy structure prediction device shown in Fig. 10, the prosody hierarchy structure prediction
The network structure of model includes cascade full articulamentum, N number of characteristic processing layer, normalization layer;The N is positive integer;The spy
Sign process layer includes non-linear sublayer and from attention sublayer.
Optionally, on the basis of prosody hierarchy structure prediction device shown in Fig. 10, the prosody hierarchy structure prediction
Module 1004 is specifically used for:
Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction mould
Type be four disaggregated models, each word for predicting in text belong to rhythm word boundary, prosodic phrase boundary, intonation phrasal boundary,
The probability on non-rhythm structure boundary;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes in the prosody hierarchy structure sequence
The prosody hierarchy structure type of each word and the corresponding maximum probability of each word mark.
Optionally, on the basis of prosody hierarchy structure prediction device shown in Fig. 10, the prosody hierarchy structure prediction
Module 1004 is specifically used for:
Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction mould
Type be three disaggregated models, for the word in text belong to non-rhythm structure boundary, rhythm word boundary, prosodic phrase boundary it is general
Rate;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes in the prosody hierarchy structure sequence
The mark of the prosody hierarchy structure type of each word and the corresponding maximum probability of each word.
Optionally, on the basis of prosody hierarchy structure prediction device shown in Fig. 10, the prosody hierarchy structure prediction
Module 1004 is specifically used for:
Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction mould
Type is two disaggregated models, and each word for predicting in text belongs to the probability on prosodic phrase boundary and non-prosodic phrase boundary;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes in the prosody hierarchy structure sequence
The mark of the prosody hierarchy structure type of each word and the corresponding maximum probability of each word.
It optionally, is the application referring to Figure 14, Figure 14 on the basis of prosody hierarchy structure prediction device shown in Fig. 10
The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, the device further include:
Voice synthetic module 1401 obtains target language for carrying out speech synthesis according to the prosody hierarchy structure sequence
Sound.
The above-mentioned prosody hierarchy structure prediction device based on AI is used based on the deep neural network mould from attention mechanism
Type predicts the prosody hierarchy structure respectively segmented in target text, is somebody's turn to do based on the deep neural network from attention mechanism
Model is by the context dependency therein that can preferably capture from attention sublayer within the scope of full sentence, compared to related skill
CRF model, RNN model in art, the deep neural network model have better Series Modeling ability, correspondingly, compared to
CRF model, RNN model in the related technology, the deep neural network model can obtain better prosody hierarchy structure prediction
Effect, and then help to be promoted the quality of subsequent voice synthesis.
The embodiment of the present application also provides a kind of for predicting the server and terminal device of prosody hierarchy structure, below will
The server and terminal device of prediction prosody hierarchy structure provided by the embodiments of the present application are carried out from the angle of hardware entities
It introduces.
It is a kind of server architecture schematic diagram provided by the embodiments of the present application referring to Figure 15, Figure 15, which can
Bigger difference is generated because configuration or performance are different, may include one or more central processing units (central
Processing units, CPU) 1522 (for example, one or more processors) and memory 1532, one or one with
The storage medium 1530 (such as one or more mass memory units) of upper storage application program 1542 or data 1544.Its
In, memory 1532 and storage medium 1530 can be of short duration storage or persistent storage.It is stored in the program of storage medium 1530
It may include one or more modules (diagram does not mark), each module may include to the series of instructions in server
Operation.Further, central processing unit 1522 can be set to communicate with storage medium 1530, execute on server 1500
Series of instructions operation in storage medium 1530.
Server 1500 can also include one or more power supplys 1526, one or more wired or wireless nets
Network interface 1550, one or more input/output interfaces 1558, and/or, one or more operating systems 1541, example
Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The step as performed by server can be based on the server architecture shown in figure 15 in above-described embodiment.
Wherein, CPU 1522 is for executing following steps:
Obtain target text;
Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence;
The feature extraction of word grade, which is carried out, according to the participle annotated sequence obtains word grade characteristic sequence, institute's predicate grade characteristic sequence
In each word word grade feature include at least through the resulting term vector of semantic feature extraction;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence, institute are obtained by prosody hierarchy structure prediction model
Stating prosody hierarchy structure prediction model is based on the deep neural network model from attention mechanism.
Optionally, CPU1522 can also be performed the prosody hierarchy Structure Prediction Methods in the embodiment of the present application based on AI and appoint
The method and step of one specific implementation.
Referring to Figure 16, Figure 16 is a kind of structural schematic diagram of terminal device provided by the embodiments of the present application.For the ease of saying
It is bright, part relevant to the embodiment of the present application is illustrated only, it is disclosed by specific technical details, please refer to the embodiment of the present application side
Method part.The terminal can be include computer, tablet computer, personal digital assistant (full name in English: Personal Digital
Assistant, english abbreviation: PDA) etc. any terminal device, taking the terminal as an example:
Figure 16 shows the block diagram of the part-structure of mobile phone relevant to terminal provided by the embodiments of the present application.With reference to figure
16, mobile phone includes: radio frequency (full name in English: Radio Frequency, english abbreviation: RF) circuit 1610, memory 1620, defeated
Enter unit 1630, display unit 1640, sensor 1650, voicefrequency circuit 1660, Wireless Fidelity (full name in English: wireless
Fidelity, english abbreviation: WiFi) components such as module 1670, processor 1680 and power supply 1690.Those skilled in the art
It is appreciated that handset structure shown in Figure 16 does not constitute the restriction to mobile phone, it may include more more or fewer than illustrating
Component perhaps combines certain components or different component layouts.
Memory 1620 can be used for storing software program and module, and processor 1680 is stored in memory by operation
1620 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1620 can be led
It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function
Application program (such as sound-playing function, image player function etc.) etc.;Storage data area, which can be stored, uses institute according to mobile phone
Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1620 may include high random access storage
Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid
State memory device.
Processor 1680 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone,
By running or execute the software program and/or module that are stored in memory 1620, and calls and be stored in memory 1620
Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor
1680 may include one or more processing units;Preferably, processor 1680 can integrate application processor and modulation /demodulation processing
Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located
Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1680.
In the embodiment of the present application, processor 1680 included by the terminal is also with the following functions:
Obtain target text;
Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence;
The feature extraction of word grade, which is carried out, according to the participle annotated sequence obtains word grade characteristic sequence, institute's predicate grade characteristic sequence
In each word word grade feature include at least through the resulting term vector of semantic feature extraction;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence, institute are obtained by prosody hierarchy structure prediction model
Stating prosody hierarchy structure prediction model is based on the deep neural network model from attention mechanism.
Optionally, the processor 1680 is also used to execute the prosody hierarchy structure provided by the embodiments of the present application based on AI
The step of any one implementation of prediction technique.
The embodiment of the present application also provides a kind of computer readable storage medium, for storing computer program, the computer
Program is used to execute any one in a kind of prosody hierarchy Structure Prediction Methods based on AI described in foregoing individual embodiments
Embodiment.
The embodiment of the present application also provides a kind of computer program product including instruction, when run on a computer,
So that computer execute it is any one in a kind of prosody hierarchy Structure Prediction Methods based on AI described in foregoing individual embodiments
Kind embodiment.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description,
The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.
In several embodiments provided herein, it should be understood that disclosed system, device and method can be with
It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit
It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components
It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or
The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit
It closes or communicates to connect, can be electrical property, mechanical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application
Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (full name in English: Read-Only
Memory, english abbreviation: ROM), random access memory (full name in English: Random Access Memory, english abbreviation:
RAM), the various media that can store program code such as magnetic or disk.
The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations;Although referring to before
Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding
Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features;And these
It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.
Claims (15)
1. a kind of text prosody hierarchy Structure Prediction Methods characterized by comprising
Obtain target text;
Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence;
The feature extraction of word grade, which is carried out, according to the participle annotated sequence obtains word grade characteristic sequence, it is every in institute's predicate grade characteristic sequence
The word grade feature of a word is included at least through the resulting term vector of semantic feature extraction;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence, the rhythm are obtained by prosody hierarchy structure prediction model
Restraining hierarchical structure prediction model is based on the deep neural network model from attention mechanism.
2. text prosody hierarchy Structure Prediction Methods according to claim 1, which is characterized in that described according to the participle
Annotated sequence carries out the feature extraction of word grade and obtains word grade characteristic sequence, comprising:
Semantic feature extraction is carried out to the participle annotated sequence and obtains the corresponding term vector of each word;
The location information of each word in the target text is encoded to obtain the corresponding position vector of each word;
According to punctuate after the corresponding position vector of word each in the participle annotated sequence, part of speech vector, word long vector and word
At least one of type vector term vector corresponding with each word generates the corresponding word grade feature of each word;
It combines the corresponding word grade characteristic sequence of word each in the participle annotated sequence to obtain institute's predicate grade characteristic sequence.
3. text prosody hierarchy Structure Prediction Methods according to claim 2, which is characterized in that described to be marked to the participle
Note sequence carries out speech feature extraction and obtains the corresponding term vector of each word, comprising:
Semantic feature extraction is carried out to the participle annotated sequence by semantic feature extraction model and obtains the corresponding word of each word
Vector;Wherein, the semantic feature extraction model uses BERT network structure or Skip-Gram network structure.
4. text prosody hierarchy Structure Prediction Methods according to claim 2, which is characterized in that described according to the participle
In annotated sequence after the corresponding position vector of each word, part of speech vector, word long vector and word at least one in punctuate type vector
Item term vector corresponding with each word generates the corresponding word grade feature of each word, comprising:
For each word in the participle annotated sequence, the corresponding term vector of each word and position vector are summed, it will be with
Punctuate type vector progress vector splices to obtain each word pair after value part of speech vector corresponding with each word, word long vector and word
The word grade feature answered.
5. text prosody hierarchy Structure Prediction Methods according to claim 1, which is characterized in that the method also includes:
Training sample set is obtained, the training sample set includes each training sample and fascicule corresponding with each training sample
Level structure label;
By the training sample set to based on parameter training is carried out from the deep neural network model of attention mechanism, will train
The good deep neural network model based on from attention mechanism is as the prosody hierarchy structure prediction model.
6. text prosody hierarchy Structure Prediction Methods according to claim 5, which is characterized in that the method also includes:
The corresponding prosody hierarchy structure label of each training sample is concentrated to carry out label smoothing processing the training sample;
It is then described that parameter training, packet are carried out to the deep neural network model based on attention mechanism by the training sample set
Include: by the training sample set after label Balance Treatment to the deep neural network model based on attention mechanism into
Row parameter training.
7. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described
The network structure of prosody hierarchy structure prediction model includes cascade full articulamentum, N number of characteristic processing layer, normalization layer;The N
For positive integer;The characteristic processing layer includes non-linear sublayer and from attention sublayer.
8. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model, comprising:
Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction model is
Four disaggregated models, each word for predicting in text belong to rhythm word boundary, prosodic phrase boundary, intonation phrasal boundary, non-rhythm
Restrain the probability of structure boundary;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes each in the prosody hierarchy structure sequence
The prosody hierarchy structure type of word and the corresponding maximum probability of each word mark.
9. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model, comprising:
Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction model is
Three disaggregated models, for belonging to the probability on non-rhythm structure boundary, rhythm word boundary, prosodic phrase boundary to the word in text;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes each in the prosody hierarchy structure sequence
The mark of the prosody hierarchy structure type of word and the corresponding maximum probability of each word.
10. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model, comprising:
Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction model is
Two disaggregated models, each word for predicting in text belong to the probability on prosodic phrase boundary and non-prosodic phrase boundary;
The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes each in the prosody hierarchy structure sequence
The mark of the prosody hierarchy structure type of word and the corresponding maximum probability of each word.
11. text prosody hierarchy Structure Prediction Methods according to claim 1, which is characterized in that the method also includes:
Speech synthesis, which is carried out, according to the prosody hierarchy structure sequence obtains target voice.
12. a kind of text prosody hierarchy structure prediction device characterized by comprising
Module is obtained, for obtaining target text;
Participle and part-of-speech tagging module, for the target text carry out participle and part-of-speech tagging obtain participle annotated sequence;
Word grade characteristic extracting module obtains word grade feature sequence for carrying out the feature extraction of word grade according to the participle annotated sequence
It arranges, the word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in institute's predicate grade characteristic sequence;
Prosody hierarchy structure prediction module, it is corresponding for obtaining institute's predicate grade characteristic sequence by prosody hierarchy structure prediction model
Prosody hierarchy structure sequence, the prosody hierarchy structure prediction model is based on the deep neural network mould from attention mechanism
Type.
13. text prosody hierarchy structure prediction device according to claim 12, which is characterized in that institute's predicate grade feature mentions
Modulus block includes:
Semantic feature extraction submodule, it is corresponding for obtaining each word to participle annotated sequence progress semantic feature extraction
Term vector;
Position vector encoding submodule is encoded to obtain each word for the location information to each word in the target text
Corresponding position vector;
Word grade feature generate submodule, for according to the corresponding position vector of word each in the participle annotated sequence, part of speech to
It is corresponding to generate each word at least one of punctuate type vector term vector corresponding with each word after amount, word long vector and word
Word grade feature;
Submodule is combined, for combining the corresponding word grade characteristic sequence of word each in the participle annotated sequence to obtain institute's predicate
Grade characteristic sequence.
14. a kind of text prosody hierarchy structure prediction equipment, which is characterized in that the equipment includes processor and memory:
The memory is for storing computer program;
The processor is used for the text fascicule according to any one of computer program perform claim requirement 1 to 11
The prediction technique of level structure.
15. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing computer
Program, the computer program is for text prosody hierarchy structure prediction side described in any one of perform claim requirement 1 to 11
Method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910834143.5A CN110534087B (en) | 2019-09-04 | 2019-09-04 | Text prosody hierarchical structure prediction method, device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910834143.5A CN110534087B (en) | 2019-09-04 | 2019-09-04 | Text prosody hierarchical structure prediction method, device, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110534087A true CN110534087A (en) | 2019-12-03 |
CN110534087B CN110534087B (en) | 2022-02-15 |
Family
ID=68667149
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910834143.5A Active CN110534087B (en) | 2019-09-04 | 2019-09-04 | Text prosody hierarchical structure prediction method, device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110534087B (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191689A (en) * | 2019-12-16 | 2020-05-22 | 恩亿科(北京)数据科技有限公司 | Sample data processing method and device |
CN111243682A (en) * | 2020-01-10 | 2020-06-05 | 京东方科技集团股份有限公司 | Method, device, medium and apparatus for predicting toxicity of drug |
CN111259041A (en) * | 2020-02-26 | 2020-06-09 | 山东理工大学 | Scientific and technological expert resource virtualization and semantic reasoning retrieval method |
CN111259625A (en) * | 2020-01-16 | 2020-06-09 | 平安科技(深圳)有限公司 | Intention recognition method, device, equipment and computer readable storage medium |
CN111292715A (en) * | 2020-02-03 | 2020-06-16 | 北京奇艺世纪科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium |
CN111339771A (en) * | 2020-03-09 | 2020-06-26 | 广州深声科技有限公司 | Text prosody prediction method based on multi-task multi-level model |
CN111524557A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111667816A (en) * | 2020-06-15 | 2020-09-15 | 北京百度网讯科技有限公司 | Model training method, speech synthesis method, apparatus, device and storage medium |
CN111724765A (en) * | 2020-06-30 | 2020-09-29 | 上海优扬新媒信息技术有限公司 | Method and device for converting text into voice and computer equipment |
CN111951781A (en) * | 2020-08-20 | 2020-11-17 | 天津大学 | Chinese prosody boundary prediction method based on graph-to-sequence |
CN112052673A (en) * | 2020-08-28 | 2020-12-08 | 丰图科技(深圳)有限公司 | Logistics network point identification method and device, computer equipment and storage medium |
CN112131878A (en) * | 2020-09-29 | 2020-12-25 | 腾讯科技(深圳)有限公司 | Text processing method and device and computer equipment |
CN112183084A (en) * | 2020-09-07 | 2021-01-05 | 北京达佳互联信息技术有限公司 | Audio and video data processing method, device and equipment |
CN112309368A (en) * | 2020-11-23 | 2021-02-02 | 北京有竹居网络技术有限公司 | Prosody prediction method, device, equipment and storage medium |
CN112463921A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Prosodic hierarchy dividing method and device, computer equipment and storage medium |
CN112668315A (en) * | 2020-12-23 | 2021-04-16 | 平安科技(深圳)有限公司 | Automatic text generation method, system, terminal and storage medium |
CN113096641A (en) * | 2021-03-29 | 2021-07-09 | 北京大米科技有限公司 | Information processing method and device |
CN113129864A (en) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | Voice feature prediction method, device, equipment and readable storage medium |
WO2021189984A1 (en) * | 2020-10-22 | 2021-09-30 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, and device and computer-readable storage medium |
CN113901210A (en) * | 2021-09-15 | 2022-01-07 | 昆明理工大学 | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair |
CN114091444A (en) * | 2021-11-15 | 2022-02-25 | 北京声智科技有限公司 | Text processing method and device, computer equipment and storage medium |
WO2022142105A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Text-to-speech conversion method and apparatus, electronic device, and storage medium |
CN115116428A (en) * | 2022-05-19 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Prosodic boundary labeling method, apparatus, device, medium, and program product |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147405A1 (en) * | 2006-12-13 | 2008-06-19 | Fujitsu Limited | Chinese prosodic words forming method and apparatus |
CN105185374A (en) * | 2015-09-11 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy annotation method and device |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN107239444A (en) * | 2017-05-26 | 2017-10-10 | 华中科技大学 | A kind of term vector training method and system for merging part of speech and positional information |
CN107464559A (en) * | 2017-07-11 | 2017-12-12 | 中国科学院自动化研究所 | Joint forecast model construction method and system based on Chinese rhythm structure and stress |
CN109492583A (en) * | 2018-11-09 | 2019-03-19 | 安徽大学 | A kind of recognition methods again of the vehicle based on deep learning |
-
2019
- 2019-09-04 CN CN201910834143.5A patent/CN110534087B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080147405A1 (en) * | 2006-12-13 | 2008-06-19 | Fujitsu Limited | Chinese prosodic words forming method and apparatus |
CN105185374A (en) * | 2015-09-11 | 2015-12-23 | 百度在线网络技术(北京)有限公司 | Prosodic hierarchy annotation method and device |
CN106601228A (en) * | 2016-12-09 | 2017-04-26 | 百度在线网络技术(北京)有限公司 | Sample marking method and device based on artificial intelligence prosody prediction |
CN107239444A (en) * | 2017-05-26 | 2017-10-10 | 华中科技大学 | A kind of term vector training method and system for merging part of speech and positional information |
CN107464559A (en) * | 2017-07-11 | 2017-12-12 | 中国科学院自动化研究所 | Joint forecast model construction method and system based on Chinese rhythm structure and stress |
CN109492583A (en) * | 2018-11-09 | 2019-03-19 | 安徽大学 | A kind of recognition methods again of the vehicle based on deep learning |
Non-Patent Citations (1)
Title |
---|
王琦: "基于深度神经网络的韵律结构预测研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (34)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111191689B (en) * | 2019-12-16 | 2023-09-12 | 恩亿科(北京)数据科技有限公司 | Sample data processing method and device |
CN111191689A (en) * | 2019-12-16 | 2020-05-22 | 恩亿科(北京)数据科技有限公司 | Sample data processing method and device |
CN113129864A (en) * | 2019-12-31 | 2021-07-16 | 科大讯飞股份有限公司 | Voice feature prediction method, device, equipment and readable storage medium |
CN111243682A (en) * | 2020-01-10 | 2020-06-05 | 京东方科技集团股份有限公司 | Method, device, medium and apparatus for predicting toxicity of drug |
CN111259625A (en) * | 2020-01-16 | 2020-06-09 | 平安科技(深圳)有限公司 | Intention recognition method, device, equipment and computer readable storage medium |
CN111259625B (en) * | 2020-01-16 | 2023-06-27 | 平安科技(深圳)有限公司 | Intention recognition method, device, equipment and computer readable storage medium |
CN111292715B (en) * | 2020-02-03 | 2023-04-07 | 北京奇艺世纪科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium |
CN111292715A (en) * | 2020-02-03 | 2020-06-16 | 北京奇艺世纪科技有限公司 | Speech synthesis method, speech synthesis device, electronic equipment and computer-readable storage medium |
CN111259041A (en) * | 2020-02-26 | 2020-06-09 | 山东理工大学 | Scientific and technological expert resource virtualization and semantic reasoning retrieval method |
CN111339771A (en) * | 2020-03-09 | 2020-06-26 | 广州深声科技有限公司 | Text prosody prediction method based on multi-task multi-level model |
CN111339771B (en) * | 2020-03-09 | 2023-08-18 | 广州深声科技有限公司 | Text prosody prediction method based on multitasking multi-level model |
CN111524557A (en) * | 2020-04-24 | 2020-08-11 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
CN111524557B (en) * | 2020-04-24 | 2024-04-05 | 腾讯科技(深圳)有限公司 | Inverse synthesis prediction method, device, equipment and storage medium based on artificial intelligence |
US11769480B2 (en) | 2020-06-15 | 2023-09-26 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for training model, method and apparatus for synthesizing speech, device and storage medium |
CN111667816B (en) * | 2020-06-15 | 2024-01-23 | 北京百度网讯科技有限公司 | Model training method, speech synthesis method, device, equipment and storage medium |
CN111667816A (en) * | 2020-06-15 | 2020-09-15 | 北京百度网讯科技有限公司 | Model training method, speech synthesis method, apparatus, device and storage medium |
CN111724765A (en) * | 2020-06-30 | 2020-09-29 | 上海优扬新媒信息技术有限公司 | Method and device for converting text into voice and computer equipment |
CN111951781A (en) * | 2020-08-20 | 2020-11-17 | 天津大学 | Chinese prosody boundary prediction method based on graph-to-sequence |
CN112052673A (en) * | 2020-08-28 | 2020-12-08 | 丰图科技(深圳)有限公司 | Logistics network point identification method and device, computer equipment and storage medium |
CN112183084B (en) * | 2020-09-07 | 2024-03-15 | 北京达佳互联信息技术有限公司 | Audio and video data processing method, device and equipment |
CN112183084A (en) * | 2020-09-07 | 2021-01-05 | 北京达佳互联信息技术有限公司 | Audio and video data processing method, device and equipment |
CN112131878A (en) * | 2020-09-29 | 2020-12-25 | 腾讯科技(深圳)有限公司 | Text processing method and device and computer equipment |
WO2021189984A1 (en) * | 2020-10-22 | 2021-09-30 | 平安科技(深圳)有限公司 | Speech synthesis method and apparatus, and device and computer-readable storage medium |
CN112309368A (en) * | 2020-11-23 | 2021-02-02 | 北京有竹居网络技术有限公司 | Prosody prediction method, device, equipment and storage medium |
CN112463921A (en) * | 2020-11-25 | 2021-03-09 | 平安科技(深圳)有限公司 | Prosodic hierarchy dividing method and device, computer equipment and storage medium |
CN112463921B (en) * | 2020-11-25 | 2024-03-19 | 平安科技(深圳)有限公司 | Prosody hierarchy dividing method, prosody hierarchy dividing device, computer device and storage medium |
CN112668315A (en) * | 2020-12-23 | 2021-04-16 | 平安科技(深圳)有限公司 | Automatic text generation method, system, terminal and storage medium |
WO2022142105A1 (en) * | 2020-12-31 | 2022-07-07 | 平安科技(深圳)有限公司 | Text-to-speech conversion method and apparatus, electronic device, and storage medium |
CN113096641A (en) * | 2021-03-29 | 2021-07-09 | 北京大米科技有限公司 | Information processing method and device |
CN113901210B (en) * | 2021-09-15 | 2022-12-13 | 昆明理工大学 | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair |
CN113901210A (en) * | 2021-09-15 | 2022-01-07 | 昆明理工大学 | Method for marking verbosity of Thai and Burma characters by using local multi-head attention to mechanism fused word-syllable pair |
CN114091444A (en) * | 2021-11-15 | 2022-02-25 | 北京声智科技有限公司 | Text processing method and device, computer equipment and storage medium |
CN115116428A (en) * | 2022-05-19 | 2022-09-27 | 腾讯科技(深圳)有限公司 | Prosodic boundary labeling method, apparatus, device, medium, and program product |
CN115116428B (en) * | 2022-05-19 | 2024-03-15 | 腾讯科技(深圳)有限公司 | Prosodic boundary labeling method, device, equipment, medium and program product |
Also Published As
Publication number | Publication date |
---|---|
CN110534087B (en) | 2022-02-15 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110534087A (en) | A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium | |
CN110490213B (en) | Image recognition method, device and storage medium | |
CN111368993B (en) | Data processing method and related equipment | |
CN110609891A (en) | Visual dialog generation method based on context awareness graph neural network | |
CN109344404B (en) | Context-aware dual-attention natural language reasoning method | |
CN110377916B (en) | Word prediction method, word prediction device, computer equipment and storage medium | |
CN110334354A (en) | A kind of Chinese Relation abstracting method | |
CN109271493A (en) | A kind of language text processing method, device and storage medium | |
CN108846077A (en) | Semantic matching method, device, medium and the electronic equipment of question and answer text | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN108628935A (en) | A kind of answering method based on end-to-end memory network | |
CN112288075A (en) | Data processing method and related equipment | |
CN110502610A (en) | Intelligent sound endorsement method, device and medium based on text semantic similarity | |
US11645479B1 (en) | Method for AI language self-improvement agent using language modeling and tree search techniques | |
CN114676234A (en) | Model training method and related equipment | |
CN110796160A (en) | Text classification method, device and storage medium | |
CN107679225A (en) | A kind of reply generation method based on keyword | |
CN110991290A (en) | Video description method based on semantic guidance and memory mechanism | |
CN110334196B (en) | Neural network Chinese problem generation system based on strokes and self-attention mechanism | |
CN113901191A (en) | Question-answer model training method and device | |
CN109710760A (en) | Clustering method, device, medium and the electronic equipment of short text | |
Zhou et al. | ICRC-HIT: A deep learning based comment sequence labeling system for answer selection challenge | |
CN108345612A (en) | A kind of question processing method and device, a kind of device for issue handling | |
CN114882862A (en) | Voice processing method and related equipment | |
CN111767720B (en) | Title generation method, computer and readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |