CN110534087A

CN110534087A - A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium

Info

Publication number: CN110534087A
Application number: CN201910834143.5A
Authority: CN
Inventors: 康世胤; 吴志勇; 杜耀
Original assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Current assignee: Tencent Technology Shenzhen Co Ltd; Shenzhen Graduate School Tsinghua University
Priority date: 2019-09-04
Filing date: 2019-09-04
Publication date: 2019-12-03
Anticipated expiration: 2039-09-04
Also published as: CN110534087B

Abstract

The embodiment of the present application discloses a kind of prosody hierarchy Structure Prediction Methods, device, equipment and storage medium based on artificial intelligence, wherein this method comprises: obtaining target text；Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence；The feature extraction of word grade is carried out according to participle annotated sequence and obtains word grade characteristic sequence, and the word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in the word grade characteristic sequence；The corresponding prosody hierarchy structure sequence of word grade characteristic sequence is obtained by prosody hierarchy structure prediction model, which is based on the deep neural network model from attention mechanism.This method can effectively improve the precision of prediction for prosody hierarchy structure.

Description

A kind of text prosody hierarchy Structure Prediction Methods, device, equipment and storage medium

Technical field

This application involves voice technology fields, more particularly to based on the text fascicule based on artificial intelligence from attention mechanism Level structure prediction technique, device, equipment and storage medium.

Background technique

Prosody hierarchy structure is the modeling of the prosodic features such as pause, rhythm to voice, rhythm structure prediction task be Speech synthesis front end text-processing part determines the rhythm structure type of each function word in sentence according to text feature.

Rhythm structure prediction has great significance to the naturalness of speech synthesis system synthesis sound quality.Rhythm common at present Rule structure prediction mainly use condition random field (CRF), Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN it) is modeled, but the performance of modeling of both schemes in practical applications is not high, limits voice to a certain degree The quality of synthesis.

Summary of the invention

The embodiment of the present application provides a kind of text prosody hierarchy Structure Prediction Methods based on artificial intelligence, device, sets Standby and storage medium can effectively improve the precision of prediction for prosody hierarchy structure.

In view of this, the application first aspect provides a kind of text prosody hierarchy structure prediction side based on artificial intelligence Method, comprising:

Obtain target text；

Participle is carried out to the target text and part-of-speech tagging obtains participle annotated sequence；

The feature extraction of word grade, which is carried out, according to the participle annotated sequence obtains word grade characteristic sequence, institute's predicate grade characteristic sequence In each word word grade feature include at least through the resulting term vector of semantic feature extraction；

The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence, institute are obtained by prosody hierarchy structure prediction model Stating prosody hierarchy structure prediction model is based on the deep neural network model from attention mechanism.

The application second aspect provides a kind of text prosody hierarchy structure prediction device based on artificial intelligence, comprising:

Module is obtained, for obtaining target text；

Participle and part-of-speech tagging module, for the target text carry out participle and part-of-speech tagging obtain participle mark sequence Column；

Word grade characteristic extracting module obtains word grade feature for carrying out the feature extraction of word grade according to the participle annotated sequence Sequence, the word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in institute's predicate grade characteristic sequence；

Prosody hierarchy structure prediction module, for obtaining institute's predicate grade characteristic sequence by prosody hierarchy structure prediction model Corresponding prosody hierarchy structure sequence, the prosody hierarchy structure prediction model are based on the depth nerve net from attention mechanism Network model.

The application third aspect provides a kind of text prosody hierarchy structure prediction equipment based on artificial intelligence, described to set Standby includes processor and memory:

The memory is for storing computer program；

The processor be used for according to the computer program execute above-mentioned first aspect described in text prosody hierarchy knot Structure prediction technique.

The application fourth aspect provides a kind of computer readable storage medium, and the computer readable storage medium is used for Computer program is stored, the computer program is for executing text prosody hierarchy Structure Prediction Methods described in first aspect.

The 5th aspect of the application provides a kind of computer program product including instruction, when it runs on computers When, so that the computer executes text prosody hierarchy Structure Prediction Methods described in above-mentioned first aspect.

As can be seen from the above technical solutions, the embodiment of the present application has the advantage that

The embodiment of the present application provides a kind of text prosody hierarchy Structure Prediction Methods, and this method, which utilizes, to be based on from attention The deep neural network model of mechanism predicts that prosody hierarchy structure, the prediction for effectively improving prosody hierarchy structure is quasi- Exactness.Specifically, in prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application, after getting target text, to this Target text carries out participle and part-of-speech tagging handles to obtain participle annotated sequence；Then, word grade is carried out according to participle annotated sequence Feature extraction obtains word grade characteristic sequence, and the word grade feature of each word is included at least in word grade feature training mentions through semantic feature Take resulting term vector；In turn, the corresponding prosody hierarchy knot of word grade characteristic sequence is obtained by prosody hierarchy structure prediction model Structure sequence, the prosody hierarchy structure prediction model based on the deep neural network model from attention mechanism.Above-mentioned fascicule Level structure prediction technique is used based on the deep neural network model from attention mechanism, to the rhythm respectively segmented in target text Hierarchical structure is predicted, should respectively be segmented based on that can capture within the scope of full sentence from the deep neural network model of attention mechanism Between context dependency, compared in the related technology CRF model, RNN model have better Series Modeling ability, because This, can effectively promote the prediction effect of prosody hierarchy structure, and correspondingly promote the quality of speech synthesis.

Detailed description of the invention

Fig. 1 is that the application scenarios of the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on artificial intelligence show It is intended to；

Fig. 2 is that a kind of process of the prosody hierarchy Structure Prediction Methods based on artificial intelligence provided by the embodiments of the present application is shown It is intended to；

Fig. 3 is the work configuration diagram of prosody hierarchy structure prediction model provided by the embodiments of the present application；

Fig. 4 is the flow diagram of the training method of prosody hierarchy structure prediction model provided by the embodiments of the present application；

Fig. 5 is the schematic diagram that scaling dot product attention provided by the embodiments of the present application calculates similarity；

Fig. 6 is the schematic diagram of calculation flow of bull attention mechanism provided by the embodiments of the present application；

Fig. 7 is the operation schematic diagram of fully-connected network sublayer provided by the embodiments of the present application；

Fig. 8 is the schematic diagram of residual error provided by the embodiments of the present application connection；

Fig. 9 is the flow diagram of another prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application；

Figure 10 is the knot of the first prosody hierarchy structure prediction device based on artificial intelligence provided by the embodiments of the present application Structure schematic diagram；

Figure 11 is the knot of the second provided by the embodiments of the present application prosody hierarchy structure prediction device based on artificial intelligence Structure schematic diagram；

Figure 12 is the knot of the third prosody hierarchy structure prediction device based on artificial intelligence provided by the embodiments of the present application Structure schematic diagram；

Figure 13 is the knot of the 4th kind provided by the embodiments of the present application prosody hierarchy structure prediction device based on artificial intelligence Structure schematic diagram；

Figure 14 is the knot of the 5th kind provided by the embodiments of the present application prosody hierarchy structure prediction device based on artificial intelligence Structure schematic diagram；

Figure 15 is a kind of service for being used to predict prosody hierarchy structure based on artificial intelligence provided by the embodiments of the present application The structural schematic diagram of device；

Figure 16 is a kind of terminal for being used to predict prosody hierarchy structure based on artificial intelligence provided by the embodiments of the present application The structural schematic diagram of equipment.

Specific embodiment

In order to make those skilled in the art more fully understand application scheme, below in conjunction in the embodiment of the present application Attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only this Apply for a part of the embodiment, instead of all the embodiments.Based on the embodiment in the application, those of ordinary skill in the art exist Every other embodiment obtained under the premise of creative work is not made, shall fall in the protection scope of this application.

The description and claims of this application and term " first ", " second ", " third ", " in above-mentioned attached drawing The (if present)s such as four " are to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should manage The data that solution uses in this way are interchangeable under appropriate circumstances, so as to embodiments herein described herein can in addition to Here the sequence other than those of diagram or description is implemented.In addition, term " includes " and " having " and their any deformation, Be intended to cover it is non-exclusive include, for example, containing the process, method of a series of steps or units, system, product or setting It is standby those of to be not necessarily limited to be clearly listed step or unit, but may include be not clearly listed or for these mistakes The intrinsic other step or units of journey, method, product or equipment.

This application involves the fields artificial intelligence (Artificial Intelligence, AI), below to artificial intelligence field The relevant technologies are simply introduced.

So-called artificial intelligence is machine simulation, extension and the extension people controlled using digital computer or digital computer Intelligence, perception environment, obtain knowledge and using Knowledge Acquirement optimum theory, method, technology and application system.Change sentence It talks about, artificial intelligence is a complex art of computer science, it attempts to understand essence of intelligence, and is produced a kind of new The intelligence machine that can be made a response in such a way that human intelligence is similar.

Artificial intelligence technology is an interdisciplinary study, is related to that field is extensive, and the technology of existing hardware view also has software layer The technology in face.Artificial intelligence basic technology generally comprise as sensor, Special artificial intelligent chip, cloud computing, distributed storage, The technologies such as big data processing technique, operation/interactive system, electromechanical integration.Artificial intelligence software's technology mainly includes computer Several general orientation such as vision technique, voice processing technology, natural language processing technique and machine learning/deep learning.

Wherein, natural language processing (Nature Language processing, NLP) is computer science and people An important directions in work smart field.It, which studies to be able to achieve between people and computer, carries out efficient communication with natural language Various theory and methods.Natural language processing is one and melts linguistics, computer science, mathematics in the science of one.Therefore, this The research in one field will be related to natural language, i.e. people's language used in everyday, so it and philological research have closely Connection.Natural language processing technique generally includes text-processing, semantic understanding, machine translation, robot question and answer, knowledge mapping Etc. technologies.

It generallys use CRF, RNN in the related technology to be modeled, and based on this, to the fascicule respectively segmented in text Level structure is predicted；However, CRF, RNN can not usually capture the dependence within the scope of full sentence between any two word, so Their modeling ability is limited, and then causes accurately predict prosody hierarchy structure based on them.

For problem present in above-mentioned the relevant technologies, the embodiment of the present application provides a kind of prosody hierarchy knot based on AI Structure prediction technique, this method are used based on the deep neural network model from attention mechanism, to what is respectively segmented in target text Prosody hierarchy structure is predicted, should be passed through based on the deep neural network model from attention mechanism therein sub from attention Layer can preferably capture the context dependency within the scope of full sentence, should compared to CRF model, RNN model in the related technology Deep neural network model has better Series Modeling ability, correspondingly, compared to CRF model, RNN mould in the related technology Type, the deep neural network model can obtain better prosody hierarchy structure prediction effect, and then help to promote subsequent language The quality of sound synthesis.

It should be understood that the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on AI can be applied to have number According to the equipment of processing capacity, such as terminal device, server；Wherein, terminal device be specifically as follows smart phone, computer, Personal digital assistant (Personal Digital Assitant, PDA), tablet computer etc.；Server is specifically as follows using clothes Business device, or Web server, in actual deployment, which can be separate server, or cluster service Device.

Technical solution provided by the embodiments of the present application in order to facilitate understanding, below with fascicule provided by the embodiments of the present application Level structure prediction technique is applied to for server, is applicable in prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application Application scenarios carry out exemplary introduction.

Referring to Fig. 1, Fig. 1 is the application scenarios of the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on AI Schematic diagram.As shown in Figure 1, the application scenarios include: terminal device 110 and server 120, terminal device 110 and server 120 It is communicated by network.Wherein, terminal device 110 is used to receive the voice signal of user's input, and the voice signal is passed Transport to server 120；Server 120 is used to determine the corresponding answer text of voice signal that terminal device 110 transmits, and executes Prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application carry out in advance to the prosody hierarchy structure respectively segmented in text is replied It surveys, generates and reply the corresponding prosody hierarchy structure sequence of text, in turn, text will be replied according to the prosody hierarchy structure sequence and turned It is changed to corresponding answer voice signal, is transmitted to terminal device 110.

When concrete application, user can be to 110 input speech signal of terminal device, and being directed to requesting terminal equipment 110 should Voice signal replys corresponding answer voice signal；After terminal device 110 receives the voice signal of user's input, by the voice Signal is transmitted through the network to server 120.

After server 120 receives the voice signal of the transmission of terminal device 110, first determine for replying the voice signal Reply text.In turn, using the answer text as target text, participle is carried out to the target text and part-of-speech tagging handles to obtain Corresponding participle annotated sequence；Then, the feature extraction of word grade is carried out for the participle annotated sequence obtain word grade characteristic sequence, it should The word grade feature of each word includes at least the term vector obtained through semantic feature extraction in word grade characteristic sequence；Then, pass through rhythm Rule hierarchical structure prediction model determines the corresponding prosody hierarchy structure sequence of word grade characteristic sequence, the prosody hierarchy structure prediction Model is based on the deep neural network model from attention mechanism.

Server 120, can further base after above-mentioned processing determines and replies the corresponding prosody hierarchy structure sequence of text In the prosody hierarchy structure sequence, generates and reply the corresponding answer voice signal of text, the answer voice signal is more naturally, more Stick on the pronunciation of person of modern times's class.Finally, to terminal device 110, terminal is set the answer transmitting voice signal that server 120 is generated Standby 110 play the answer voice signal, realize its human-computer interaction between user.

It should be noted that human-computer interaction application scenarios shown in FIG. 1 are merely illustrative, in practical applications, the application is real The prosody hierarchy Structure Prediction Methods for applying example offer can also be applied to other scenes, for example, the text conversion that user is uploaded For the scene etc. of voice, the application scenarios not being applicable in herein prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application are done Any restriction.

The prosody hierarchy Structure Prediction Methods provided by the present application based on AI are introduced below by embodiment.

Referring to fig. 2, Fig. 2 is a kind of process of the prosody hierarchy Structure Prediction Methods based on AI provided by the embodiments of the present application Schematic diagram.For ease of description, for following embodiments are using server as executing subject, to the prosody hierarchy structure prediction side Method is introduced.As shown in Fig. 2, the prosody hierarchy Structure Prediction Methods the following steps are included:

Step 201: obtaining target text.

When server need for target text synthesize its corresponding voice signal when, in order to obtain it is more natural, closer to The voice signal of human articulation, server can first predict the corresponding rhythm structure level sequence of the target text, in turn, then with Based on the corresponding rhythm structure level sequence of the target text, the corresponding voice signal of the target text is synthesized.

It should be noted that server can obtain target text by different modes under different application scenarios. By taking the application scenarios of human-computer interaction as an example, the voice signal that the available terminal device of server is sent, and the voice will be converted The text that signal obtains is as target text；For carrying out the application scenarios of voice conversion process to text, server can be with The text to be converted that terminal device is sent can be by it from other servers or database as target text or server The text to be converted for locating to obtain is as target text；Etc..The mode that the application does not obtain target text to server herein is done Any restriction.

It should be understood that when prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application are applied to terminal device, terminal Equipment can also obtain target text by different modes under different application scenarios.With the application scenarios of human-computer interaction For, the voice signal that user inputs can be converted to corresponding text by terminal device, and then using the text as target text This；For carrying out the application scenarios of voice conversion process to text, the text conduct of the available user's input of terminal device The text that target text or the available server transport of terminal device come is as target text；Etc..The application is herein Also the mode for not obtaining target text to terminal device does any restriction.

Step 202: participle being carried out to the target text and part-of-speech tagging obtains participle annotated sequence.

After server gets target text, word segmentation processing first can be carried out to the target text, obtain target text pair The segmentation sequence answered；In turn, part-of-speech tagging processing is carried out for each participle in segmentation sequence, it is corresponding obtains target text Segment annotated sequence.

It should be noted that having more mature participle processing method and part-of-speech tagging method in the related technology, herein The participle processing method that can be directly used in the related technology carries out word segmentation processing to target text, and using in the related technology Part-of-speech tagging method segmentation sequence that word segmentation processing is obtained carry out part-of-speech tagging processing, the application does not use herein to specific Participle processing method and part-of-speech tagging method do any restriction.

Step 203: the feature extraction of word grade being carried out according to the participle annotated sequence and obtains word grade characteristic sequence, institute's predicate grade The word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in characteristic sequence.

It, can be to each word in the participle annotated sequence after server gets the corresponding participle annotated sequence of target text The feature extraction of word grade is carried out in turn, the word grade feature of each word is sequentially combined to obtain the word grade feature of each word, Obtain the corresponding word grade characteristic sequence of target text.The word grade feature of each word is included at least through language in word grade characteristic sequence herein The term vector that adopted feature extraction obtains.

It should be noted that in practical applications, in order to further enhance the prediction effect for prosody hierarchy structure, often The word grade feature of a word can also include the corresponding position of each word other than including the term vector obtained through semantic feature extraction At least one after vector, part of speech vector, word long vector and word in punctuate vector is protected in this way, enriching the word grade feature of each word Demonstrate,proving subsequent can more accurately predict prosody hierarchy structure based on word grade feature abundant.

When specific implementation, server can carry out semantic feature extraction to participle annotated sequence, and it is corresponding to obtain each word Term vector；The location information of word each in text is encoded, the corresponding position vector of each word is obtained；In turn, according to point In word annotated sequence after the corresponding position vector of each word, part of speech vector, word long vector and word in punctuate type vector at least The term vector of one and each word, generates the corresponding word grade feature of each word；Finally, by each word pair in participle annotated sequence The word grade characteristic sequence answered combines, and obtains word grade characteristic sequence.

When specific extraction semantic feature, server can carry out language to participle annotated sequence by semantic feature extraction model Adopted feature extraction obtains the corresponding term vector of each word；That is, server can will segment annotated sequence in each word it is defeated one by one Enter to semantic feature extraction model, to obtain the corresponding term vector of each word of semantic feature extraction model output.

It should be noted that in order to enable the term vector obtained through semantic feature extraction is rich in more semantic features, clothes Pre-trained alternating binary coding device, which can be used, in business device indicates (BERT) network structure or Skip-Gram network structure, as Above-mentioned semantic feature extraction model.Above-mentioned semantic feature extraction model (i.e. BERT network structure or Skip-Gram network knot Structure) it is that pre-training obtains on Big-corpus, the term vector extracted is generally rich in contextual information, has preferable language Adopted feature carries out subsequent prosody hierarchy structure prediction based on the term vector, can aid in and promote the pre- of prosody hierarchy structure Survey effect.

It should be understood that in practical applications, server can also be using other network structures as above-mentioned semantic feature extraction Model, the application are not specifically limited the model structure of semantic feature extraction model herein.

Although based on any distance can be learnt into full sentence from the prosody hierarchy structure prediction model of attention mechanism Dependence between word, but the relative position distance between word is but ignored because from attention mechanism, in order to make Can utilize relative position information based on the prosody hierarchy structure prediction model from attention mechanism is subsequent, the application adopt herein With the mechanism of time signal (timing signal), the location information of word each in target text is encoded to obtain every The corresponding position vector of a word directly can use formula (1) and (2) to encode location information without learning any parameter:

Wherein, t is the index value of time step, and 2i and 2i+1 are the dimension index value of coding, and d is position encoded dimension.

Part of speech vector is used to indicate the part of speech of word, such as indicates the vector that part of speech is noun, the vector that expression part of speech is verb Etc..Word long vector is used to indicate the number of words for including in word, as indicated to include the vector of two words in word, indicating in word Including triliteral vector etc..Punctuate type vector is used for punctuate when whether having punctuate after indicating word and have punctuate after word Type, if there is punctuate after word, punctuate type vector is the vector for indicating the punctuate type after the word after word, if nothing after word Punctuate, then punctuate type vector is for indicating the vector without punctuate after the word after word.Above-mentioned part of speech vector, word long vector and Punctuate type vector can be indicated using solely hot vector after word.

It is " to show best regards and fine wish with target text." for, segmentation sequence is that " it is sincere to show It greets and fine wish.", last corresponding part of speech vector of word " wish " in the text be indicate part of speech be noun to Amount, corresponding word long vector are the vector indicated in word comprising two words, and punctuate type vector is to mark after indicating word after word Point is the vector of fullstop.

It should be understood that in practical applications, server can according to actual needs, for each word in participle annotated sequence At least one of punctuate type vector after above-mentioned position vector, part of speech vector, word long vector and word is generated, in turn, by institute Word of at least one of the punctuate type vector with each word after the position vector of generation, part of speech vector, word long vector and word Vector combines, and generates the corresponding word grade feature of each word.

In one possible implementation, position vector, part of speech vector, word are generated for each word in server After long vector and word in the case where punctuate type vector, server can be for each word in participle annotated sequence, by each word Corresponding term vector and position vector are summed, by this and value part of speech vector corresponding with each word, word long vector and word Punctuate type vector carries out vector splicing afterwards, obtains the corresponding word grade feature of each word.

For example, it is assumed that server is carried out semantic for i-th of word in participle annotated sequence using semantic feature extraction model It is e that feature extraction, which obtains its corresponding term vector,_i, by term vector e_iIt is encoded with the location information to i-th of word Position vector is summed, and then, by this and is worth and text feature set r_iCarry out vector splicing, text feature set r_iBy i-th Punctuate type vector is spliced to form after the part of speech vector of a word, word long vector and word.

In this way, during predicting prosody hierarchy structure, it will be closely related with prosody hierarchy structure type In factor is considered in, the semantic feature referred to during prosody hierarchy structure prediction is enriched, and then guarantee prosody hierarchy knot The accuracy of structure prediction.

After the corresponding word grade feature of word each in above-mentioned processing acquisition participle annotated sequence, in turn, marked according to participle Each word puts in order in sequence, the corresponding word grade characteristic sequence of each word is combined, it is corresponding to obtain target text Word grade characteristic sequence.

Step 204: the corresponding prosody hierarchy knot of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model Structure sequence, the prosody hierarchy structure prediction model are based on the deep neural network model from attention mechanism.

After server generates the corresponding word grade characteristic sequence of target text, prosody hierarchy structure prediction model is further utilized The word grade characteristic sequence is handled, to obtain the corresponding prosody hierarchy structure sequence of target text, the prosody hierarchy knot Structure prediction model is based on the deep neural network model from attention mechanism.

In one possible implementation, the network structure of above-mentioned prosody hierarchy structure prediction model includes cascade complete Articulamentum, N (N is positive integer) a characteristic processing layer and normalization layer；Wherein, characteristic processing layer specifically include non-linear sublayer and From attention sublayer.

Referring to Fig. 3, Fig. 3 is a kind of work of illustrative prosody hierarchy structure prediction model provided by the embodiments of the present application Configuration diagram.The word grade characteristic sequence of input prosody hierarchy structure prediction model is represented by W=(w₁,w₂,…,w_i,…, w_n), wherein w_iIt for the word grade feature of i-th of word in word grade characteristic sequence, generates: being instructed in advance using warp in the following manner Experienced semantic feature extraction model carries out semantic feature extraction to i-th of word and obtains term vector, by the term vector and position vector It sums, then spells this and value with the vector being spliced to form by punctuate type vector after part of speech vector, word long vector and word It picks up and, the word grade feature of i-th of word can be obtained.

The full articulamentum of prosody hierarchy structure prediction model front end is mixed for the word grade characteristic sequence of input to be carried out feature It closes to acquire the high-rise expression of feature, gathers into folds to form depth network followed by N number of identical characteristic processing layer heap, wherein often A characteristic processing layer is constituted by non-linear sublayer and from attention sublayer, and input, the output of sublayer are connected using residual error Structure, the output par, c of each sublayer also standardizes layer, and the last layer is using normalization (i.e. softmax layers) output mesh of layer Mark the probability distribution of the rhythm structure type of each word in text.

It should be noted that in practical applications, the value of N can be set according to actual needs, do not done specifically to N value herein It limits.

It, will be to rhythm shown in Fig. 3 when the subsequent training method to prosody hierarchy structure prediction model of the application is introduced Each layer network structure is introduced in detail in rule hierarchical structure prediction model, referring particularly to prosody hierarchy structure prediction mould hereinafter Related content in the training method of type, details are not described herein again.

It should be understood that in practical applications, it can be according to actual needs using other network structures as prosody hierarchy structure The structure of prediction model, prosody hierarchy structure prediction model shown in Fig. 3 is merely illustrative, and the application is not herein to prosody hierarchy knot The specific structure of structure prediction model does any restriction.

In one possible implementation, word grade characteristic sequence is input to prosody hierarchy structure prediction mould by server Type, the prosody hierarchy structure prediction model are four disaggregated models, are used to predict that each word in text to belong to non-rhythm structure side Boundary (not a boundary, NB), rhythm word boundary (prosodic word, PW), prosodic phrase boundary (prosodic Phrase, PPH), the probability of intonation phrasal boundary (intonational phrase, IPH)；In turn, it is special to obtain word grade for server The corresponding prosody hierarchy structure sequence of sequence is levied, includes that each word and each word are corresponding general in the prosody hierarchy structure sequence The maximum prosody hierarchy structure type mark of rate.

Specifically, after word grade characteristic sequence is inputted the prosody hierarchy structure prediction model by server, the prosody hierarchy knot Structure prediction model will correspondingly predict that each word in target text belongs to the probability of NB, PW, PPH, IPH.Then each word is determined The prosody hierarchy structure type of corresponding maximum probability identifies, and it is corresponding that prosody hierarchy structure type mark can characterize the word Prosody hierarchy structure；In turn, each word corresponding prosody hierarchy structure type mark arranged in sequence is got up, can be obtained with it is defeated The corresponding prosody hierarchy structure sequence of word grade characteristic sequence entered.

In alternatively possible implementation, word grade characteristic sequence is inputted prosody hierarchy structure prediction mould by server Type, the prosody hierarchy structure prediction model are three disaggregated models, are used to predict that each word in text to belong to non-rhythm structure side Boundary (NB), rhythm word boundary (PW) and prosodic phrase boundary (PPH), probability；In turn, server obtains the word grade characteristic sequence Corresponding prosody hierarchy structure sequence includes each word and the corresponding maximum probability of each word in the prosody hierarchy structure sequence Prosody hierarchy structure type mark.

Specifically, after word grade characteristic sequence is inputted the prosody hierarchy structure prediction model by server, the prosody hierarchy knot Structure prediction model will correspondingly predict that each word in target text belongs to the probability of NB, PW and PPH.Then determine that each word is corresponding Maximum probability prosody hierarchy structure type mark, the prosody hierarchy structure type mark can characterize the corresponding rhythm of the word Hierarchical structure；In turn, the corresponding prosody hierarchy structure type mark arranged in sequence of each word is got up, is can be obtained and input The corresponding prosody hierarchy structure sequence of word grade characteristic sequence.

It should be noted that when prosody hierarchy structure prediction model is three disaggregated model, in addition to can be used for predicting Word belongs to outside the probability of NB, PW and PPH in target text, can be also used for the probability that prediction word belongs to PW, IPH and PPH, certainly It can be used for the probability that prediction word belongs to other three kinds of prosody hierarchy structures, it is not pre- to the prosody hierarchy structure of three classification herein It surveys the foreseeable three kinds of prosody hierarchy structures of model and does any restriction.

In another possible implementation, it is pre- that word grade characteristic sequence can will be inputted prosody hierarchy structure by server Model is surveyed, which is two disaggregated models, is used to predict that each word in target text to belong to the rhythm The probability of phrasal boundary (PPH) and non-prosodic phrase boundary；In turn, server obtains the corresponding prosody hierarchy of word grade characteristic sequence Structure sequence includes the prosody hierarchy structure of each word and the corresponding maximum probability of each word in the prosody hierarchy structure sequence The mark of type.

Specifically, after word grade characteristic sequence is inputted the prosody hierarchy structure prediction model by server, the prosody hierarchy knot Structure prediction model will correspondingly predict that each word in target text belongs to the probability of PPH and non-PPH.When the corresponding maximum probability of word Prosody hierarchy structure type when being identified as PPH, characterize the word and belong to PPH, when the prosody hierarchy knot of the corresponding maximum probability of word When structure type identification is non-PPH, characterizes the word and belong to non-PPH；In turn, the corresponding prosody hierarchy structure type of each word is identified Arranged in sequence gets up, and prosody hierarchy structure sequence corresponding with the word grade characteristic sequence of input can be obtained.

It should be noted that when prosody hierarchy structure prediction model is two disaggregated model, in addition to can be used for predicting Word belongs to outside the probability of PPH and non-PPH in target text, can be also used for the probability that prediction word belongs to IPH and non-IPH, can also With for predicting that word belongs to the probability of PW and non-PW, the prosody hierarchy structure prediction model that do not classify herein to two is foreseeable Two kinds of prosody hierarchy structures do any restriction.

The corresponding prosody hierarchy structure sequence of target text can be obtained after the processing of step 201 to step 204 in server Column, in turn, server can according to the corresponding prosody hierarchy structure sequence of the target text, target sound type, target word speed, Target volume and target sampling rate carry out speech synthesis processing, to obtain the corresponding target voice of target text.

It should be understood that above-mentioned target sound type, target word speed, target volume and target sampling rate can be set by individual subscriber It sets, or the parameter of speech synthesis system default setting, herein not to above-mentioned target sound type, target word speed, target Volume and target sampling rate set-up mode and specific value do any restriction.

The above-mentioned prosody hierarchy Structure Prediction Methods based on AI are used based on the deep neural network mould from attention mechanism Type predicts the prosody hierarchy structure respectively segmented in target text, is somebody's turn to do based on the deep neural network from attention mechanism Model is by the context dependency therein that can preferably capture from attention sublayer within the scope of full sentence, compared to related skill CRF model, RNN model in art, the deep neural network model have better Series Modeling ability, correspondingly, compared to CRF model, RNN model in the related technology, the deep neural network model can obtain better prosody hierarchy structure prediction Effect, and then help to be promoted the quality of subsequent voice synthesis.

It should be understood that in practical applications, the prosody hierarchy Structure Prediction Methods energy provided by the embodiments of the present application based on AI It is no accurately to predict the corresponding prosody hierarchy structure of target text, depend primarily on the model of prosody hierarchy structure prediction model Can, and the close phase of training process of the model performance of the prosody hierarchy structure prediction model and the prosody hierarchy structure prediction model It closes.It is introduced below by training method of the embodiment to prosody hierarchy structure prediction model provided by the present application.

Referring to fig. 4, Fig. 4 is that the process of the training method of prosody hierarchy structure prediction model provided by the embodiments of the present application is shown It is intended to.For ease of description, for following embodiments are using server as executing subject, to the prosody hierarchy structure prediction model Training method be introduced.Referring to fig. 4, the prosody hierarchy structure prediction model training method the following steps are included:

Step 401: obtain training sample set, the training sample set include each training sample and with each training sample Corresponding prosody hierarchy structure label.

Before being trained to prosody hierarchy structure prediction model, it usually needs obtain a large amount of training sample, and every A corresponding prosody hierarchy structure label of training sample, to form the training for being used for training rhythm hierarchical structure prediction model Sample set.

It should be noted that the prosody hierarchy structure label for training sample mark and the prosody hierarchy knot to be trained The type of structure prediction model is closely related；When prosody hierarchy structure prediction model be for predict in text each word belong to PW, When four disaggregated model of the probability of PPH, IPH and NB, server should include for the prosody hierarchy structure of training sample mark These four prosody hierarchy structure types of PW, PPH, IPH and NB mark；When prosody hierarchy structure prediction model is for predicting text In each word probability for belonging to NB, PW and PPH three disaggregated models when, server is directed to the prosody hierarchy structure of training sample mark It should include that these three prosody hierarchy structure types of NB, PW and PPH identify；When prosody hierarchy structure prediction model is for predicting When each word belongs to two disaggregated model of the probability of PPH and non-PPH in text, server is directed to the prosody hierarchy of training sample mark Structure should include that both prosody hierarchy structure types of PPH and non-PPH identify；And so on.

Optionally, in order to allow prosody hierarchy structure prediction model learning to certain uncertainty, help to be promoted The generalization ability of the prosody hierarchy structure prediction model promotes the prediction effect of prosody hierarchy structure prediction model, and server can To concentrate the corresponding prosody hierarchy structure label of each training sample to carry out label smoothing processing training sample.

By taking the corresponding prosody hierarchy structure label of training sample includes PW, PPH, IPH and NB these four types of as an example, it is assumed that one Function word belongs to IPH, then the corresponding prosody hierarchy structure label of the function word is expressed as follows with only hot vector:

TAG_IPH=(0,0,0,1)

After carrying out label smoothing processing to the prosody hierarchy structure label, that is, joined noise, certain journey is introduced The uncertainty of degree, it is assumed that smooth value is set as 0.1, then the label vector after label smoothing processing is expressed as follows:

SMOOTH_IPH=(0.03,0.03,0.03,0.9)

In this way, before being trained to prosody hierarchy structure prediction model, it can be to the corresponding rhythm of all training samples Rule hierarchical structure label is smoothed, allow in training process prosody hierarchy structure prediction model acquire it is certain not Certainty.

Step 402: being joined by the training sample set to based on the deep neural network model from attention mechanism Number training, using the trained deep neural network model based on attention mechanism as the prosody hierarchy structure prediction Model.

After getting training sample set, server can use acquired training sample set, to construct in advance based on Carry out parameter training from the deep neural network model of attention mechanism, until training obtain meeting trained termination condition based on From the deep neural network model of attention mechanism, in turn, by this based on the deep neural network model work from attention mechanism For prosody hierarchy structure prediction model, can be put into practical application.

Below constructed in advance based on the deep neural network model from attention mechanism as model structure shown in Fig. 3 For, the training method of prosody hierarchy structure prediction model is introduced；Below first to this based on the depth from attention mechanism Being introduced respectively from attention sublayer, non-linear sublayer and residual error connection type in degree neural network model:

Attention mechanism can be regarded as an inquiry (query) and a series of keys (key) value (value) and look into obtaining this The expression ask, concrete processing procedure are as follows: the inquiry and each key be subjected to similarity calculation and obtain a series of weights, Then weight is carried out to corresponding value to sum to obtain the expression of the inquiry.It is specific can be using common similar when calculating similarity Calculation method, such as additivity attention, dot product attention are spent, below using the scaling dot product attention in dot product attention For (Scaled dot-product attention), similarity calculation process is introduced.

It is the flow diagram that similarity is calculated using scaling dot product attention mechanism referring to Fig. 5, Fig. 5, wherein Q is to look into Sequence is ask, K is a series of key, and V is value corresponding to key.As shown in figure 5, Q and the advanced row matrix multiplying of K, then lead to Cross scaling factor and carry out scaling transformation, operation then is normalized to the transformed result of scaling, eventually by with V into Row matrix multiplication is exported.Specific calculating process can be expressed as formula (3):

Wherein, d is scaling factor, and Q is the dimension of vector.

And only need a sequence that can calculate the expression of this sequence from attention mechanism.Bull attention is to pass through Inquiry, key, value carry out h linear transformation, then concurrently carry out scaling dot product, and each scaling dot product can obtain a d_vDimension It indicates, in turn, by by h d_vThe value of dimension is spliced, and h*d is formed_vVector obtain one output, bull attention mechanism Specific calculation process it is as shown in Figure 6.

Shown in the calculation formula such as formula (4) and formula (5) of bull attention mechanism:

MultiHead (Q, K, V)=Concat (head_1,..., head_h)W (5)

Wherein,Be respectively inquiry, key, value it is linear Transformation matrix,To splice the last time matrix of a linear transformation that each scaling dot product output valve is done.In this Shen Please embodiment provide prosody hierarchy structure prediction model in, the number of bull can be set to 8, for each head can will Parameter setting is d=256, d_k=d_v=d/h=64；Certainly, in practical applications, can also set according to actual needs above-mentioned Parameter is not specifically limited above-mentioned parameter herein.

The application has done further explorative research to from attention mechanism, will be applied to prosody hierarchy from attention mechanism Structure prediction.Specifically, passing through realizing in prosody hierarchy structure prediction model from attention sublayer, master from attention mechanism It is used to capture the context dependency within the scope of full sentence between each word, forms the word grade mark sheet for being rich in contextual information Show, the rhythm structure of higher levels is possible to dependent on word apart from each other, with from attention mechanism primarily to capturing Dependence between word apart from each other, to facilitate the prediction effect of promotion prosody hierarchy structure.Assuming that a sentence In have T word, calculate the last one word in this character representation need the semantic feature based on words all in this, by asking Their similarity obtains the weighted value of each word, then obtains the character representation of the word in the form that weight is summed.

Compared to CRF, RNN model in the related technology, from attention mechanism can directly capture beginning of the sentence word and sentence tail word it Between dependence, it is insensitive for the distance between two words；And CRF, RNN model need to carry out T-1 calculating and could be formed The feature of last word inputs, and just study is to the dependence between beginning of the sentence word and sentence tail word, in addition, following by so multiple When ring calculates last word of arrival, still it cannot be guaranteed that remaining with the complete information of beginning of the sentence word.Based on the depth from attention mechanism Neural network model compares CRF, RNN model, the dependence being more advantageous between study word apart from each other, and prosody hierarchy The prediction of intonation phrasal boundary in structure prediction tends to rely on a upper intonation phrasal boundary apart from each other, from attention Mechanism depends on the input of each word of full sentence, and calculation is more advantageous to the structural information that sentence entirety is arrived in study.

It is adapted to different use demands, it can be using fully-connected network sublayer or circulation nerve net in non-linear sublayer String bag layer；Specifically, when pursuing faster training speed, can using fully-connected network sublayer as non-linear sublayer, when When pursuing higher predictablity rate, Recognition with Recurrent Neural Network sublayer can be used.Separately below to these two types of non-linear sublayers into Row is introduced:

When non-linear sublayer is fully-connected network sublayer, as shown in fig. 7, the fully-connected network sublayer can be with attention certainly Power network sub-layer is used in combination, and carries out nonlinear transformation will input, is substantially carried out linear transformation twice, wherein centre one Layer uses line rectification (Rectified Linear Unit, ReLU) activation primitive；Shown in specific calculation process such as formula (6):

FFN (X)=ReLU (XW₁)W₂ (6)

Wherein, W₁∈R^d×d, W₂∈R^d×dThe parameter of required study when to train fully-connected network sublayer.

When non-linear sublayer is Recognition with Recurrent Neural Network sublayer, because it is word grade is special that prosody hierarchy structure prediction, which is input, Sequence is levied, output is corresponding prosody hierarchy structure sequence, then be in fact exactly sequence to the mapping between sequence, though RNN So suitable Series Modeling, but when sequence is longer, RNN can lead to training difficulty because of gradient explosion or gradient dispersion, band The RNN of door control mechanism is the method for a relatively good above-mentioned training problem of solution, has been primarily due to shot and long term memory network (Long Short-Term Memory, LSTM) unit and its variant GRU unit, GRU unit have more compared to LSTM unit Succinct door, and parameter is few, model is restrained faster, since RNN only has unidirectional contextual information, it is therefore desirable to using double Two-way contextual information is obtained to RNN.

Although the RNN with door can learn the dependence to past tense spacer step, since a side can only be acquired Upward information so that its performance is restricted, two-way RNN enable to network acquire the context in both direction according to Therefore the relationship of relying can apply two-way RNN network structure in prosody hierarchy structure prediction model in this application, therefore, Non-linear sublayer RNN can have following configuration:

1, unidirectional GRU-RNN sublayer；

2, two-way LSTM-RNN sublayer, i.e., two-way length memory unit (Bidirectional Long Short-Term in short-term Memory, BLSTM)；

3, two-way GRU-RNN sublayer, i.e. two-way tape gating cycle unit (Bidirectional Gated Recurrent Unit, BGRU).

Due to there are residual error connection, needing to keep data input dimension and output dimension phase between the input and output of sublayer Together, therefore, when using unidirectional GRU-RNN sublayer, it is 256 dimensions that neuron number, which can be set, when using BLSTM or BGRU When layer, the neuron number of each direction setting is 128 dimensions, and two-way output is spliced to form 256 dimensions.

Deep neural network model can have showing for saturation even decline in the accuracy rate of training set with the increase of the number of plies As, here it is the degenerate problem of neural network model, residual error connection is the effective ways of a trained deep neural network model, Its implementation specifically can there are residual error connections between each sublayer inside characteristic processing layer, and in junction to each dimension Add operation is carried out, concrete operations process is as shown in Figure 8.

In each sublayer of prosody hierarchy structure prediction model that residual error connection is used in the application, formula (7) table can be used Show calculating process:

Y=X+SubLayer (X) (7)

Wherein, X and Y respectively indicates outputting and inputting for each sublayer.

After residual error connection, it can also further pass through layer standardized operation, to control distribution between layers；This Shen Please in prosody hierarchy structure prediction model need to stack repeatedly identical characteristic processing layer, increase with number is stacked, mould The increase of moldeed depth degree also brings along and is difficult to trained problem, connected by residual error, can aid in prosody hierarchy knot in the application The training of structure prediction model is more favorable to attempt deeper network structure configuration.

It should be understood that be trained based on the deep neural network model from attention mechanism in addition to that can be shown in Fig. 3 Model structure outside, can also be other model structures, the application is not herein to being trained based on from attention mechanism The structure of deep neural network model does any restriction.

When whether the trained deep neural network model of specific judgement meets trained termination condition, it can use test specimens This verifies the first model, which is the training sample using training sample concentration to deep neural network model Carry out the model that first round training obtains；Specifically, test sample is inputted first model by server, first model is utilized The test sample of input is correspondingly handled, the corresponding prosody hierarchy structure of the test sample is obtained；In turn, according to test Sample it is corresponding mark prosody hierarchy structure and the first model output as a result, determine first model predictablity rate, when It, i.e., can be true it is believed that the working performance of first model preferably meet demand when the predictablity rate is greater than preset threshold Fixed first model be the deep neural network model for meeting trained termination condition, can using the deep neural network model as Prosody hierarchy structure prediction model.

Moreover, it is judged that when whether above-mentioned deep neural network model meets trained termination condition, it can also be according to more trainings in rotation The multiple models got, it is determined whether continue to be trained the deep neural network model, it is optimal to obtain working performance Prosody hierarchy structure prediction model.Specifically, can use test sample respectively to the multiple depth got through more trainings in rotation Neural network model is verified, however, it is determined that the difference between the predictablity rate for the deep neural network model that each training in rotation is got Away from smaller, then it is assumed that the model performance of the deep neural network model without room for promotion, can choose predictablity rate most High deep neural network model is as the prosody hierarchy structure prediction model for meeting training termination condition；If it is determined that each wheel training Gap between the predictablity rate of obtained deep neural network model is larger, then it is assumed that the property of the deep neural network model It can continue to be trained the deep neural network model there are also room for promotion, until it is most stable and optimal to obtain performance Deep neural network model, as prosody hierarchy structure prediction model.

The training method of above-mentioned prosody hierarchy structure prediction model, using acquired training sample set, to preparatory building Based on parameter training is carried out from the deep neural network model of attention mechanism, in turn, trained will be based on from attention The deep neural network model of mechanism puts into practical application as prosody hierarchy structure prediction model.It should be based on from attention machine The deep neural network model of system by it is therein from attention sublayer can preferably capture the context within the scope of full sentence according to The relationship of relying, compared to CRF model, RNN model in the related technology, which has better Series Modeling energy Power, correspondingly, compared to CRF model, RNN model in the related technology, which can be obtained preferably Prosody hierarchy structure prediction effect, and then help to be promoted the quality of subsequent voice synthesis.

For the ease of further understanding the prosody hierarchy Structure Prediction Methods provided by the embodiments of the present application based on AI, below It is applied to the application scenarios of target text synthesis target voice for transmitting user, and prosody hierarchy structure prediction mould in this way For type is four disaggregated models, whole exemplary introduction is done to the prosody hierarchy Structure Prediction Methods.It is the base referring to Fig. 9, Fig. 9 In the flow diagram of the prosody hierarchy Structure Prediction Methods of AI.

When user needs target text " showing best regards and fine wish " synthesizing its corresponding target voice When, " showing best regards and fine wish " can be input to terminal device by user, to pass through terminal device for the mesh File Transfer is marked to server；After server gets the target text, participle and part-of-speech tagging first are carried out to the target text Processing, obtains the corresponding participle annotated sequence of the target text.

Then, server participle annotated sequence corresponding to target text carries out word grade feature extraction processing, is corresponded to Word grade characteristic sequence.Participle annotated sequence progress semantic feature is mentioned specifically, server can use BERT network structure It takes, obtains the corresponding term vector of each word in participle annotated sequence；And the location information of word each in target text is carried out Coding obtains the corresponding position vector of each word in participle annotated sequence；In turn, for participle annotated sequence in each word, Its corresponding term vector is summed with position vector, by this and value part of speech vector corresponding with the word, word long vector and Punctuate type vector carries out vector splicing after word, obtains the corresponding word grade feature of the word；Finally, according to each in participle annotated sequence A word puts in order, and the corresponding word grade characteristic sequence of word each in participle annotated sequence is combined, word grade feature is obtained Sequence.

Next, word grade characteristic sequence generated is input to prosody hierarchy structure prediction model, the rhythm by server Hierarchical structure prediction model is based on the deep neural network model from attention mechanism, at the prosody hierarchy prediction model Each word in generation target text is belonged to the probability of NB, PW, PPH and IPH by reason；In turn, true for each word in target text The corresponding prosody hierarchy structural identification of maximum probability is determined, as the corresponding prosody hierarchy structure of the word, for example, coming for " showing " It says, belongs to the maximum probability of PPH, then can determine that " showing " corresponding prosody hierarchy structure is PPH, in another example, for " really It is sincere " for, belong to the maximum probability of NB, then can determine that " sincere " corresponding prosody hierarchy structure is NB；Etc..Finally, The corresponding prosody hierarchy structural identification arranged in sequence of word each in identified target text is got up, it is corresponding to obtain target text Prosody hierarchy structure sequence be " show<PPH>sincere<NB><PW>greet<PW>wish of<IPH>and<PW>fine<NB>< IPH>”。

In turn, server can be set based on the corresponding prosody hierarchy structure sequence of target text that it is determined in conjunction with user Fixed target sound type, target word speed, target volume and target sampling rate generates the corresponding target voice of target text.And The target voice is transmitted to terminal device, to play the target voice by terminal device.

For the above-described prosody hierarchy Structure Prediction Methods based on AI, AI is based on present invention also provides corresponding Prosody hierarchy structure prediction device so that the above-mentioned prosody hierarchy Structure Prediction Methods based on AI in practice application and It realizes.

Referring to Figure 10, Figure 10 is the corresponding a kind of base of prosody hierarchy Structure Prediction Methods based on AI shown in figure 2 above In the structural schematic diagram of the prosody hierarchy structure prediction device 1000 of AI, which includes:

Module 1001 is obtained, for obtaining target text；

Participle and part-of-speech tagging module 1002, for the target text carry out participle and part-of-speech tagging obtain participle mark Infuse sequence；

Word grade characteristic extracting module 1003 obtains word grade for carrying out the feature extraction of word grade according to the participle annotated sequence Characteristic sequence, in institute's predicate grade characteristic sequence the word grade feature of each word include at least through the resulting word of semantic feature extraction to Amount；

Prosody hierarchy structure prediction module 1004, for obtaining institute's predicate grade feature by prosody hierarchy structure prediction model The corresponding prosody hierarchy structure sequence of sequence, the prosody hierarchy structure prediction model are based on the depth mind from attention mechanism Through network model.

It optionally, is the application referring to Figure 11, Figure 11 on the basis of prosody hierarchy structure prediction device shown in Fig. 10 The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, wherein institute's predicate grade characteristic extracting module 1003 include:

Semantic feature extraction submodule 1101, it is each for being obtained to participle annotated sequence progress semantic feature extraction The corresponding term vector of word；

Position vector encoding submodule 1102 encode for the location information to each word in the target text To the corresponding position vector of each word；

Word grade feature generate submodule 1103, for according to the corresponding position of word each in the participle annotated sequence to At least one of punctuate type vector term vector corresponding with each word generates every after amount, part of speech vector, word long vector and word The corresponding word grade feature of a word；

Submodule 1104 is combined, for combining the corresponding word grade characteristic sequence of word each in the participle annotated sequence To institute's predicate grade characteristic sequence.

Optionally, shown in Figure 11 on the basis of prosody hierarchy structure prediction device, the semantic feature extraction submodule Block 1101 is specifically used for:

It is corresponding that each word is obtained to participle annotated sequence progress semantic feature extraction by semantic feature extraction model Term vector；Wherein, the semantic feature extraction model uses BERT network structure or Skip-Gram network structure.

Optionally, shown in Figure 11 on the basis of prosody hierarchy structure prediction device, institute's predicate grade feature generates submodule Block 1103 is specifically used for:

For each word in the participle annotated sequence, the corresponding term vector of each word and position vector are summed, Will after part of speech vector corresponding with each word with value, word long vector and word punctuate type vector carry out vector splice to obtain it is each The corresponding word grade feature of word.

It optionally, is the application referring to Figure 12, Figure 12 on the basis of prosody hierarchy structure prediction device shown in Fig. 10 The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, the device further include:

Sample acquisition module 1201, for obtaining training sample set, the training sample set include each training sample and Prosody hierarchy structure label corresponding with each training sample；

Training module 1202, for by the training sample set to based on the deep neural network mould from attention mechanism Type carries out parameter training, using the trained deep neural network model based on from attention mechanism as the fascicule Level structure prediction model.

It optionally, is the application referring to Figure 13, Figure 13 shown in Figure 12 on the basis of prosody hierarchy structure prediction device The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, the device further include:

Label smoothing module 1301, for concentrating the corresponding prosody hierarchy of each training sample to the training sample Structure label carries out label smoothing processing；

Then the training module 1202 is specifically used for:

By the training sample set after label Balance Treatment to the deep neural network based on attention mechanism Model carries out parameter training.

Optionally, on the basis of prosody hierarchy structure prediction device shown in Fig. 10, the prosody hierarchy structure prediction The network structure of model includes cascade full articulamentum, N number of characteristic processing layer, normalization layer；The N is positive integer；The spy Sign process layer includes non-linear sublayer and from attention sublayer.

Optionally, on the basis of prosody hierarchy structure prediction device shown in Fig. 10, the prosody hierarchy structure prediction Module 1004 is specifically used for:

Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction mould Type be four disaggregated models, each word for predicting in text belong to rhythm word boundary, prosodic phrase boundary, intonation phrasal boundary, The probability on non-rhythm structure boundary；

The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes in the prosody hierarchy structure sequence The prosody hierarchy structure type of each word and the corresponding maximum probability of each word mark.

Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction mould Type be three disaggregated models, for the word in text belong to non-rhythm structure boundary, rhythm word boundary, prosodic phrase boundary it is general Rate；

The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes in the prosody hierarchy structure sequence The mark of the prosody hierarchy structure type of each word and the corresponding maximum probability of each word.

Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction mould Type is two disaggregated models, and each word for predicting in text belongs to the probability on prosodic phrase boundary and non-prosodic phrase boundary；

It optionally, is the application referring to Figure 14, Figure 14 on the basis of prosody hierarchy structure prediction device shown in Fig. 10 The structural schematic diagram for another prosody hierarchy structure prediction device that embodiment provides, the device further include:

Voice synthetic module 1401 obtains target language for carrying out speech synthesis according to the prosody hierarchy structure sequence Sound.

The above-mentioned prosody hierarchy structure prediction device based on AI is used based on the deep neural network mould from attention mechanism Type predicts the prosody hierarchy structure respectively segmented in target text, is somebody's turn to do based on the deep neural network from attention mechanism Model is by the context dependency therein that can preferably capture from attention sublayer within the scope of full sentence, compared to related skill CRF model, RNN model in art, the deep neural network model have better Series Modeling ability, correspondingly, compared to CRF model, RNN model in the related technology, the deep neural network model can obtain better prosody hierarchy structure prediction Effect, and then help to be promoted the quality of subsequent voice synthesis.

The embodiment of the present application also provides a kind of for predicting the server and terminal device of prosody hierarchy structure, below will The server and terminal device of prediction prosody hierarchy structure provided by the embodiments of the present application are carried out from the angle of hardware entities It introduces.

It is a kind of server architecture schematic diagram provided by the embodiments of the present application referring to Figure 15, Figure 15, which can Bigger difference is generated because configuration or performance are different, may include one or more central processing units (central Processing units, CPU) 1522 (for example, one or more processors) and memory 1532, one or one with The storage medium 1530 (such as one or more mass memory units) of upper storage application program 1542 or data 1544.Its In, memory 1532 and storage medium 1530 can be of short duration storage or persistent storage.It is stored in the program of storage medium 1530 It may include one or more modules (diagram does not mark), each module may include to the series of instructions in server Operation.Further, central processing unit 1522 can be set to communicate with storage medium 1530, execute on server 1500 Series of instructions operation in storage medium 1530.

Server 1500 can also include one or more power supplys 1526, one or more wired or wireless nets Network interface 1550, one or more input/output interfaces 1558, and/or, one or more operating systems 1541, example Such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..

The step as performed by server can be based on the server architecture shown in figure 15 in above-described embodiment.

Wherein, CPU 1522 is for executing following steps:

Obtain target text；

Optionally, CPU1522 can also be performed the prosody hierarchy Structure Prediction Methods in the embodiment of the present application based on AI and appoint The method and step of one specific implementation.

Referring to Figure 16, Figure 16 is a kind of structural schematic diagram of terminal device provided by the embodiments of the present application.For the ease of saying It is bright, part relevant to the embodiment of the present application is illustrated only, it is disclosed by specific technical details, please refer to the embodiment of the present application side Method part.The terminal can be include computer, tablet computer, personal digital assistant (full name in English: Personal Digital Assistant, english abbreviation: PDA) etc. any terminal device, taking the terminal as an example:

Figure 16 shows the block diagram of the part-structure of mobile phone relevant to terminal provided by the embodiments of the present application.With reference to figure 16, mobile phone includes: radio frequency (full name in English: Radio Frequency, english abbreviation: RF) circuit 1610, memory 1620, defeated Enter unit 1630, display unit 1640, sensor 1650, voicefrequency circuit 1660, Wireless Fidelity (full name in English: wireless Fidelity, english abbreviation: WiFi) components such as module 1670, processor 1680 and power supply 1690.Those skilled in the art It is appreciated that handset structure shown in Figure 16 does not constitute the restriction to mobile phone, it may include more more or fewer than illustrating Component perhaps combines certain components or different component layouts.

Memory 1620 can be used for storing software program and module, and processor 1680 is stored in memory by operation 1620 software program and module, thereby executing the various function application and data processing of mobile phone.Memory 1620 can be led It to include storing program area and storage data area, wherein storing program area can be needed for storage program area, at least one function Application program (such as sound-playing function, image player function etc.) etc.；Storage data area, which can be stored, uses institute according to mobile phone Data (such as audio data, phone directory etc.) of creation etc..In addition, memory 1620 may include high random access storage Device, can also include nonvolatile memory, and a for example, at least disk memory, flush memory device or other volatibility are solid State memory device.

Processor 1680 is the control centre of mobile phone, using the various pieces of various interfaces and connection whole mobile phone, By running or execute the software program and/or module that are stored in memory 1620, and calls and be stored in memory 1620 Interior data execute the various functions and processing data of mobile phone, to carry out integral monitoring to mobile phone.Optionally, processor 1680 may include one or more processing units；Preferably, processor 1680 can integrate application processor and modulation /demodulation processing Device, wherein the main processing operation system of application processor, user interface and application program etc., modem processor is mainly located Reason wireless communication.It is understood that above-mentioned modem processor can not also be integrated into processor 1680.

In the embodiment of the present application, processor 1680 included by the terminal is also with the following functions:

Obtain target text；

Optionally, the processor 1680 is also used to execute the prosody hierarchy structure provided by the embodiments of the present application based on AI The step of any one implementation of prediction technique.

The embodiment of the present application also provides a kind of computer readable storage medium, for storing computer program, the computer Program is used to execute any one in a kind of prosody hierarchy Structure Prediction Methods based on AI described in foregoing individual embodiments Embodiment.

The embodiment of the present application also provides a kind of computer program product including instruction, when run on a computer, So that computer execute it is any one in a kind of prosody hierarchy Structure Prediction Methods based on AI described in foregoing individual embodiments Kind embodiment.

It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description, The specific work process of device and unit, can refer to corresponding processes in the foregoing method embodiment, and details are not described herein.

In several embodiments provided herein, it should be understood that disclosed system, device and method can be with It realizes by another way.For example, the apparatus embodiments described above are merely exemplary, for example, the unit It divides, only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components It can be combined or can be integrated into another system, or some features can be ignored or not executed.Another point, it is shown or The mutual coupling, direct-coupling or communication connection discussed can be through some interfaces, the indirect coupling of device or unit It closes or communicates to connect, can be electrical property, mechanical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.

It, can also be in addition, each functional unit in each embodiment of the application can integrate in one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, server or the network equipment etc.) executes the complete of each embodiment the method for the application Portion or part steps.And storage medium above-mentioned includes: USB flash disk, mobile hard disk, read-only memory (full name in English: Read-Only Memory, english abbreviation: ROM), random access memory (full name in English: Random Access Memory, english abbreviation: RAM), the various media that can store program code such as magnetic or disk.

The above, above embodiments are only to illustrate the technical solution of the application, rather than its limitations；Although referring to before Embodiment is stated the application is described in detail, those skilled in the art should understand that: it still can be to preceding Technical solution documented by each embodiment is stated to modify or equivalent replacement of some of the technical features；And these It modifies or replaces, the spirit and scope of each embodiment technical solution of the application that it does not separate the essence of the corresponding technical solution.

Claims

1. a kind of text prosody hierarchy Structure Prediction Methods characterized by comprising

Obtain target text；

The feature extraction of word grade, which is carried out, according to the participle annotated sequence obtains word grade characteristic sequence, it is every in institute's predicate grade characteristic sequence The word grade feature of a word is included at least through the resulting term vector of semantic feature extraction；

The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence, the rhythm are obtained by prosody hierarchy structure prediction model Restraining hierarchical structure prediction model is based on the deep neural network model from attention mechanism.

2. text prosody hierarchy Structure Prediction Methods according to claim 1, which is characterized in that described according to the participle Annotated sequence carries out the feature extraction of word grade and obtains word grade characteristic sequence, comprising:

Semantic feature extraction is carried out to the participle annotated sequence and obtains the corresponding term vector of each word；

The location information of each word in the target text is encoded to obtain the corresponding position vector of each word；

According to punctuate after the corresponding position vector of word each in the participle annotated sequence, part of speech vector, word long vector and word At least one of type vector term vector corresponding with each word generates the corresponding word grade feature of each word；

It combines the corresponding word grade characteristic sequence of word each in the participle annotated sequence to obtain institute's predicate grade characteristic sequence.

3. text prosody hierarchy Structure Prediction Methods according to claim 2, which is characterized in that described to be marked to the participle Note sequence carries out speech feature extraction and obtains the corresponding term vector of each word, comprising:

Semantic feature extraction is carried out to the participle annotated sequence by semantic feature extraction model and obtains the corresponding word of each word Vector；Wherein, the semantic feature extraction model uses BERT network structure or Skip-Gram network structure.

4. text prosody hierarchy Structure Prediction Methods according to claim 2, which is characterized in that described according to the participle In annotated sequence after the corresponding position vector of each word, part of speech vector, word long vector and word at least one in punctuate type vector Item term vector corresponding with each word generates the corresponding word grade feature of each word, comprising:

For each word in the participle annotated sequence, the corresponding term vector of each word and position vector are summed, it will be with Punctuate type vector progress vector splices to obtain each word pair after value part of speech vector corresponding with each word, word long vector and word The word grade feature answered.

5. text prosody hierarchy Structure Prediction Methods according to claim 1, which is characterized in that the method also includes:

Training sample set is obtained, the training sample set includes each training sample and fascicule corresponding with each training sample Level structure label；

By the training sample set to based on parameter training is carried out from the deep neural network model of attention mechanism, will train The good deep neural network model based on from attention mechanism is as the prosody hierarchy structure prediction model.

6. text prosody hierarchy Structure Prediction Methods according to claim 5, which is characterized in that the method also includes:

The corresponding prosody hierarchy structure label of each training sample is concentrated to carry out label smoothing processing the training sample；

It is then described that parameter training, packet are carried out to the deep neural network model based on attention mechanism by the training sample set Include: by the training sample set after label Balance Treatment to the deep neural network model based on attention mechanism into Row parameter training.

7. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described The network structure of prosody hierarchy structure prediction model includes cascade full articulamentum, N number of characteristic processing layer, normalization layer；The N For positive integer；The characteristic processing layer includes non-linear sublayer and from attention sublayer.

8. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model, comprising:

Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction model is Four disaggregated models, each word for predicting in text belong to rhythm word boundary, prosodic phrase boundary, intonation phrasal boundary, non-rhythm Restrain the probability of structure boundary；

The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes each in the prosody hierarchy structure sequence The prosody hierarchy structure type of word and the corresponding maximum probability of each word mark.

9. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model, comprising:

Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction model is Three disaggregated models, for belonging to the probability on non-rhythm structure boundary, rhythm word boundary, prosodic phrase boundary to the word in text；

The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained, includes each in the prosody hierarchy structure sequence The mark of the prosody hierarchy structure type of word and the corresponding maximum probability of each word.

10. text prosody hierarchy Structure Prediction Methods according to any one of claim 1 to 6, which is characterized in that described The corresponding prosody hierarchy structure sequence of institute's predicate grade characteristic sequence is obtained by prosody hierarchy structure prediction model, comprising:

Institute's predicate grade characteristic sequence is inputted into the prosody hierarchy structure prediction model, the prosody hierarchy structure prediction model is Two disaggregated models, each word for predicting in text belong to the probability on prosodic phrase boundary and non-prosodic phrase boundary；

11. text prosody hierarchy Structure Prediction Methods according to claim 1, which is characterized in that the method also includes:

Speech synthesis, which is carried out, according to the prosody hierarchy structure sequence obtains target voice.

12. a kind of text prosody hierarchy structure prediction device characterized by comprising

Module is obtained, for obtaining target text；

Participle and part-of-speech tagging module, for the target text carry out participle and part-of-speech tagging obtain participle annotated sequence；

Word grade characteristic extracting module obtains word grade feature sequence for carrying out the feature extraction of word grade according to the participle annotated sequence It arranges, the word grade feature of each word is included at least through the resulting term vector of semantic feature extraction in institute's predicate grade characteristic sequence；

Prosody hierarchy structure prediction module, it is corresponding for obtaining institute's predicate grade characteristic sequence by prosody hierarchy structure prediction model Prosody hierarchy structure sequence, the prosody hierarchy structure prediction model is based on the deep neural network mould from attention mechanism Type.

13. text prosody hierarchy structure prediction device according to claim 12, which is characterized in that institute's predicate grade feature mentions Modulus block includes:

Semantic feature extraction submodule, it is corresponding for obtaining each word to participle annotated sequence progress semantic feature extraction Term vector；

Position vector encoding submodule is encoded to obtain each word for the location information to each word in the target text Corresponding position vector；

Word grade feature generate submodule, for according to the corresponding position vector of word each in the participle annotated sequence, part of speech to It is corresponding to generate each word at least one of punctuate type vector term vector corresponding with each word after amount, word long vector and word Word grade feature；

Submodule is combined, for combining the corresponding word grade characteristic sequence of word each in the participle annotated sequence to obtain institute's predicate Grade characteristic sequence.

14. a kind of text prosody hierarchy structure prediction equipment, which is characterized in that the equipment includes processor and memory:

The memory is for storing computer program；

The processor is used for the text fascicule according to any one of computer program perform claim requirement 1 to 11 The prediction technique of level structure.

15. a kind of computer readable storage medium, which is characterized in that the computer readable storage medium is for storing computer Program, the computer program is for text prosody hierarchy structure prediction side described in any one of perform claim requirement 1 to 11 Method.