CN110223671A

CN110223671A - Language rhythm Boundary Prediction method, apparatus, system and storage medium

Info

Publication number: CN110223671A
Application number: CN201910492657.7A
Authority: CN
Inventors: 潘华山; 李秀林
Original assignee: Standard Bay (shenzhen) Technology Co Ltd
Current assignee: Standard Bay (shenzhen) Technology Co Ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2019-09-10
Anticipated expiration: 2039-06-06
Also published as: CN110223671B

Abstract

The embodiment of the invention provides language rhythm Boundary Prediction method, apparatus, system and storage mediums.Language rhythm Boundary Prediction method includes extracting the embedded feature of text；It is utilized respectively the task rhythm boundary that each of at least two component models part model predicts corresponding granularity based on the embedded feature, wherein, at least one component model predicts the task rhythm boundary that the task rhythm boundary of corresponding granularity is also predicted based at least one other component model, at least one described component model is bigger than the granularity on the task rhythm boundary that at least one other component model is predicted；And final rhythm boundary is at least determined based on the task rhythm boundary in addition to the task rhythm boundary that at least one other component model is predicted.Above-mentioned technical proposal is respectively used to predict that the component model on varigrained task rhythm boundary is unified in progress language rhythm Boundary Prediction under a frame at least two, improves prediction effect.

Description

Language rhythm Boundary Prediction method, apparatus, system and storage medium

Technical field

The present invention relates to speech analysis and processing field, relate more specifically to a kind of language rhythm Boundary Prediction method, dress It sets, system and storage medium.

Background technique

In recent years, with the development of voice technology, rhythm structure analysis prediction is in speech synthesis, analysis and the nature of processing It is played an increasingly important role in terms of degree and intelligibility, therefore the prediction effect for improving language rhythm boundary has important meaning Justice.

Currently, language rhythm Boundary Prediction is often broken up into varigrained task, and it is directed to the different grain size Task establish component model each independently.It is waited for using the accuracy that such component model carries out language rhythm Boundary Prediction It improves.

Summary of the invention

The present invention is proposed in view of the above problem.

According to one aspect of the invention, it provides a kind of language rhythm Boundary Prediction methods.The described method includes:

Extract the embedded feature of text；

It is utilized respectively each of at least two component models part model and is based on the embedded corresponding grain of feature prediction The task rhythm boundary of degree, wherein at least one component model predicts that the task rhythm boundary of corresponding granularity is also based at least one The task rhythm boundary of a other assemblies model prediction, at least one described component model is than at least one other component The granularity on the task rhythm boundary of model prediction is big；And

At least based on the task rhythm in addition to the task rhythm boundary that at least one other component model is predicted Boundary determines final rhythm boundary.

Illustratively, for each of other than the component model of the rhythm Boundary Prediction task for realizing minimum particle size Component model, the task rhythm boundary of the correspondence granularity of component model prediction text is should based on embedded feature and all ratios The task rhythm boundary of corresponding granularity smaller particle size.

Illustratively, for each of at least one component model part model, insertion is based on using the component model Formula feature predicts that the task rhythm boundary of corresponding granularity includes:

The task rhythm Boundary Extraction predicted based on the embedded feature and at least one other component model The fusion feature of the correspondence granularity；

Based on the fusion feature of the correspondence granularity, the task rhythm of the correspondence granularity of the text is determined using the component model Restrain boundary.

Illustratively, the task rhythm Boundary Extraction predicted based on embedded feature and at least one other component model The fusion feature of the correspondence granularity includes:

The embedded feature is connected and task rhythm boundary that at least one other component model is predicted, to obtain Take the linked character of the correspondence granularity；

Linked character based on the correspondence granularity extracts the fusion feature of the correspondence granularity.

Illustratively, the task rhythm boundary for the correspondence granularity at least predicted based at least one component model is determined most Whole rhythm boundary includes:

Merge the task rhythm boundary of all granularities of text, with the final rhythm boundary of the determination text.

Illustratively, it is utilized respectively each of at least two component models part model and is based on the prediction pair of embedded feature The task rhythm boundary for answering granularity includes:

The task rhythm side of the first granularity of the text is predicted based on the embedded feature using first assembly model Boundary；

Task rhythm Boundary Prediction institute using the second component model based on the embedded feature and first granularity State the task rhythm boundary of the second granularity of text；And

Using third component model based on the embedded feature, the task rhythm boundary of first granularity and described The task rhythm boundary of the third granularity of text described in the task rhythm Boundary Prediction of two granularities.

Illustratively, the first granularity is rhythm word granularity, and the second granularity is prosodic phrase granularity, and third granularity is that intonation is short Language granularity.

Illustratively, before the embedded feature for extracting text, method further include:

The component model is trained according to loss function using sample data.

Illustratively, the task rhythm boundary of the correspondence granularity for the text that loss function is predicted based on each component model It determines.

Illustratively, component model is neural network component model.

Illustratively, neural network component model includes two-way shot and long term memory network and conditional random field models.

Illustratively, the embedded feature for extracting text includes:

The text is segmented, to obtain character level feature；

The character level feature is subjected to the processing of feature insertionization；

All character level features handled through feature insertionization are connected, to obtain connection features；And

The embedded feature of the text is extracted based on the connection features.

According to a further aspect of the invention, a kind of language rhythm Boundary Prediction device is additionally provided, comprising:

Extraction module, for extracting the embedded feature of text；

It is pre- based on the embedded feature to be utilized respectively each of at least two component models part model for prediction module Survey the task rhythm boundary of corresponding granularity, wherein at least one component model predicts the task rhythm boundary also base of corresponding granularity In the task rhythm boundary that at least one other component model is predicted, at least one described component model than it is described at least one The granularity on the task rhythm boundary of other assemblies model prediction is big；

Determining module, at least based in addition to the task rhythm boundary that at least one other component model is predicted Task rhythm boundary determines final rhythm boundary.

According to a further aspect of the present invention, a kind of language rhythm Boundary Prediction system is additionally provided, comprising: processor and storage Device, wherein computer program instructions are stored in the memory, when the computer program instructions are run by the processor For executing above-mentioned language rhythm Boundary Prediction method.

Another aspect according to the present invention, additionally provides a kind of storage medium, stores program on said storage and refers to It enables, described program instruction is at runtime for executing above-mentioned language rhythm Boundary Prediction method.

Technical solution according to an embodiment of the present invention is respectively used to predict varigrained task rhythm side at least two The component model on boundary is unified in progress language rhythm Boundary Prediction under a frame, improves prediction effect.

The above description is only an overview of the technical scheme of the present invention, in order to better understand the technical means of the present invention, And it can be implemented in accordance with the contents of the specification, and in order to allow above and other objects of the present invention, feature and advantage can It is clearer and more comprehensible, the followings are specific embodiments of the present invention.

Detailed description of the invention

The embodiment of the present invention is described in more detail in conjunction with the accompanying drawings, the above and other purposes of the present invention, Feature and advantage will be apparent.Attached drawing is used to provide to further understand the embodiment of the present invention, and constitutes explanation A part of book, is used to explain the present invention together with the embodiment of the present invention, is not construed as limiting the invention.In the accompanying drawings, Identical reference label typically represents same parts or step.

Fig. 1 shows the schematic block diagram of the language rhythm Boundary Prediction model of the prior art；

Fig. 2 shows the schematic flow charts of language rhythm Boundary Prediction method according to an embodiment of the invention；

Fig. 3 a shows the schematic of the task layer of language rhythm Boundary Prediction model according to an embodiment of the invention Block diagram；

Fig. 3 b shows the signal of the task layer of language rhythm Boundary Prediction model in accordance with another embodiment of the present invention Property block diagram；

Fig. 3 c shows the signal of the task layer of the language rhythm Boundary Prediction model of further embodiment according to the present invention Property block diagram；

Fig. 4 shows the signal of the feature extraction layer of language rhythm Boundary Prediction model according to an embodiment of the invention Property block diagram；

Fig. 5 shows the schematic block diagram of language rhythm Boundary Prediction model according to an embodiment of the invention；

Fig. 6 shows the schematic block diagram according to an embodiment of the invention for language rhythm Boundary Prediction device；

Fig. 7 shows the schematic block diagram according to an embodiment of the invention for language rhythm Boundary Prediction system.

Specific embodiment

In order to enable the object, technical solutions and advantages of the present invention become apparent, root is described in detail below with reference to accompanying drawings According to example embodiments of the present invention.Obviously, described embodiment is only a part of the embodiments of the present invention, rather than this hair Bright whole embodiments, it should be appreciated that the present invention is not limited by example embodiment described herein.Based on described in the present invention The embodiment of the present invention, those skilled in the art's obtained all other embodiment in the case where not making the creative labor It should all fall under the scope of the present invention.

When language rhythm Boundary Prediction scheme described herein is predicted based on content of text with voice broadcasting content of text Language rhythm boundary position.The front end text-processing for the application scenarios such as the program can be used for speech synthesis, video generates In.The information such as corresponding speech pause can be provided according to varigrained rhythm boundary position, voice is enable correctly to express language Justice improves the natural and tripping degree that voice plays, with the voice of outputting high quality.

The rhythm is the concept of an Auditory Perception, it is the necessary means of language communication, hearer can be helped to more fully understand Information entrained by voice.Have in rhythm Boundary Prediction and text and closely contact, in order to improve the nature that voice plays Degree, needs to obtain the relevant information of more rhythms from text, for example, varigrained rhythm boundary position.

By taking Chinese language as an example, the rhythm boundary of usual Chinese is divided with prosody hierarchy.The prosody hierarchy of Chinese is generally main It is divided into three basic units: rhythm word (Prosodic Word, PW), prosodic phrase (Prosodic Phrase, PPH) and language It adjusts phrase (Intonational Phrase, IPH), and the relationship between them is according to tree-shaped level knot on the rhythm Structure arranges in an orderly manner.These three basic units also respectively represent the corresponding granularity of rhythm boundary demarcation.One intonation phrase can With comprising one or more prosodic phrases, a prosodic phrase may include one or more rhythm words.Therefore, intonation phrase Granularity is maximum, and the granularity of rhythm word is minimum, and the granularity of prosodic phrase is between intonation phrase and rhythm word.That is, this The granularity of three basic units is ascending to be followed successively by rhythm word, prosodic phrase and intonation phrase.

Specifically, by taking text " prediction of main research rhythm structure herein " as an example, it itself can be used as an intonation Phrase.The text can be two prosodic phrases by rhythm boundary demarcation: " main research herein " and " prediction of rhythm structure ". Further, the text can by rhythm boundary demarcation be 6 rhythm words: " this paper ", " main ", " research ", " rhythm ", " structure " and " prediction "., it is apparent that the granularity of intonation phrase is greater than the granularity of prosodic phrase, the granularity of prosodic phrase It is greater than the granularity of rhythm word again.

In the following, the present invention carrys out descriptive language rhythm Boundary Prediction method by taking Chinese as an example, it should be understood that this is only exemplary rather than Limitation of the present invention.Language rhythm Boundary Prediction method can be also used for other language, such as English, Japanese and German etc..

Currently, rhythm Boundary Prediction is usually by the Task-decomposing of Chinese rhythm Boundary Prediction granularity at PW, PPH and IPH tri- A varigrained independent task, modeling is handled respectively.Fig. 1 shows the language rhythm Boundary Prediction model of the prior art Schematic block diagram.As shown in Figure 1, the language rhythm Boundary Prediction model of the prior art includes feature extraction layer 110, task layer 120 and 130 three parts of result output layer.Feature extraction layer 110 is used to extract the embedded feature of text.Task layer 120 is used for The embedded feature extracted based on feature extraction layer 110 predicts varigrained task rhythm side by multiple component models respectively Boundary.As a result output layer 130 exports final rhythm Boundary Prediction result based on the varigrained task rhythm boundary predicted.

Task layer 120 may include multiple component models, and different component models are respectively used to predict varigrained task Rhythm boundary.As shown in Figure 1, task layer 120 may include first assembly model 121, the second component model 122, third component Model 123, and so on, and may include N component model, wherein N is integer, such as can be equal to 4.It is appreciated that Although including being more than 3 component models in task layer 120 shown in Fig. 1,2 or 3 groups can also be wherein only included Part model.These component models are respectively used to predict varigrained rhythm boundary, thus complete varigrained prediction task. In order to distinguish with final rhythm boundary, the rhythm boundary that each component model is predicted is known as task rhythm boundary.Often A component model is independently completed the prediction task of corresponding granularity, does not have dependence between component model.For example, task layer 120 multiple component models can be respectively completed any one prediction task among tri- granularities of PW, PPH, IPH, first assembly Model 121 can complete PW prediction task, and the second component model 122 can complete PPH prediction task, third component model 123 It can complete IPH task.

Based on the task rhythm boundary that multiple component models are predicted, the final rhythm boundary of text can be exported.

In above-mentioned language rhythm Boundary Prediction model 100, for predicting the component on varigrained task rhythm boundary Model is all independent from each other each other.Each receives the feature of text respectively, is then based only upon the received feature of institute and completes The prediction task of oneself.This mode has ignored the dependence between the task rhythm boundary of each granularity, makes the language rhythm The effect of Boundary Prediction is had a greatly reduced quality.

In order at least be partially solved the above problem, the embodiment of the present invention provides a kind of language rhythm Boundary Prediction method. In the language rhythm Boundary Prediction method, varigrained prediction task is unified in a frame using multi-task learning framework Under.The identical input data of each prediction task carries out Unified Characterization and shares between task.In addition to this, for predicting difference It can establish dependence between the component model on the task rhythm boundary of granularity.Particularly, for predicting appointing for higher granularity The component model on business rhythm boundary is also appointed based on what the component model for predicting the task rhythm boundary of lower granularity was predicted Rhythm boundary be engaged in complete the prediction task of oneself.Fig. 2 shows the language rhythm according to an embodiment of the invention boundary is pre- The schematic flow chart of survey method 200.As shown in Fig. 2, the described method comprises the following steps.

Step S210 extracts the embedded feature of text.

It include all word contents that carry out language rhythm Boundary Prediction in text.It " is mainly ground herein with text below Study carefully the prediction of rhythm structure " for illustrate the language rhythm Boundary Prediction method 200.

Embedded methods are the methods for indicating discrete variable with vector row.Embedded feature is to utilize embedded methods will Original discrete objects are converted to vector row and a kind of vector representation for exporting.Embedded feature captures original object Built-in properties, to measure the similitude of object according to the similitude in vector space.It is appreciated that extracting the insertion of text Formula feature can preferably be used for the input of machine learning, for example, being used for feedforward neural network (Feed forward Neural Network, FNN) and/or multilayer feedforward neural network (Multilayer Feed forward Neural Network, MFNN input).

In one example, the character level feature of text can first be extracted.For example, can extract first the Chinese character of text, The various features information such as the long, distance of participle, part of speech, word.It is appreciated that different character level features can according to need progress spirit Adjustment living, such as increase or delete etc..The character level characteristic use embedded methods are then based on to extract to obtain text Embedded feature.

Step S220 is utilized respectively each of at least two component models part model and is based on the prediction pair of embedded feature Answer the task rhythm boundary of granularity, wherein at least one component model predict the task rhythm boundary of corresponding granularity be also based on to The task rhythm boundary of few other assemblies model prediction, at least one described component model are more at least one other than described The granularity on the task rhythm boundary that component model is predicted is big.

The component model can be the model for being used to predict rhythm boundary of any existing or following research and development, the application It is without limitation.For example, the component model can be with two-way long short-term memory-condition random field (Bidirectional Long Short-Term Memory-Conditional Random Field, BLSTM-CRF) based on modeled and It obtains.

It is appreciated that the component model at least 2, different component models are for predicting the varigrained task rhythm Boundary.Different component models can be based on identical or different mathematical model.All component models are shared step S210 and are mentioned The embedded feature taken.

In all components model, at least one component model predicts that the task rhythm boundary of corresponding granularity is also based at least The task rhythm boundary of one other assemblies model prediction.In other words, the two component models are not independent, on the contrary Ground establishes therebetween certain dependence or incidence relation.It is built on the whole for task rhythm Boundary Prediction in this way Mould, avoid causes language rhythm Boundary Prediction integrally to be imitated because having ignored dependence between varigrained task or incidence relation The problem of fruit declines.In short, a part of component model is based only upon the embedded feature of text in the technical solution of the embodiment Predict the task rhythm boundary of corresponding granularity.The members model includes the task rhythm boundary for predicting minimum particle size Component model.Another part component model is based not only on embedded feature also the appointing based on the prediction of other assemblies model of text Business rhythm Boundary Prediction corresponds to the task rhythm boundary of granularity, and for each of described another part component model, The task rhythm boundary that the component model is predicted is bigger than the granularity on the task rhythm boundary that it is based on.Generally, big granularity Task rhythm boundary must be small grain size task rhythm boundary.Still with text " prediction of main research rhythm structure herein " For, the rhythm boundary of the prosodic units of larger granularity is in two prosodic phrases: " herein main research " and " rhythm structure it is pre- Survey " between.And the position is also the rhythm boundary of the rhythm word of smaller particle size, i.e. rhythm word: between " research " and " rhythm " Boundary.So predicting the task rhythm boundary of larger granularity based on the task rhythm boundary of smaller particle size, can be improved larger The accuracy on the task rhythm boundary of granularity.

Fig. 3 a shows the task layer 320a of language rhythm Boundary Prediction model 300a according to an embodiment of the invention Schematic block diagram.Task layer 320a predicts different grains by N number of component model for the embedded feature of text based respectively The task rhythm boundary of degree.N number of component model includes first assembly model 321a, the second component model 322a, third assembly mould Type 323a ... N component model.Wherein N is the integer greater than 1.These component models share the embedded feature of text.

The granularity on the task rhythm boundary predicted according to the component model of the sequence in Fig. 3 a from left to right is gradually increased. That is the granularity on the task rhythm boundary of the second component model 322a prediction is greater than the task rhythm of first assembly model 321a prediction The granularity on boundary, and so on, it is pre- that the granularity on the task rhythm boundary of N component model prediction is greater than (N-1) component model The granularity on the task rhythm boundary of survey.

Appointing in the task layer 320a and language rhythm Boundary Prediction model 100 in language rhythm Boundary Prediction model 300a Being engaged in, there are following differences for layer 120.First assembly model 121, the second component model 122, third component model in task layer 120 123 and N component model be independent from each other.The second component model 322a, third component model in task layer 320a 323a and N component model etc. are the component models that can rely on its left side.

Specifically, in the task layer 120 of language rhythm Boundary Prediction model 100, the input of each component model is only wrapped Include the feature of the text of the output of feature extraction layer 110.

Unlike, in the task layer 320a of language rhythm Boundary Prediction model 300a, the second component model 322a's is defeated Enter in addition to the embedded feature including text, can also include the task rhythm boundary of first assembly model 321a prediction.Such as figure Shown in 3a, the input of N component model can also include the N component model left side in addition to the embedded feature including text The task rhythm boundary of all components model prediction, such as task rhythm boundary, (N-2) of the prediction of (N-1) component model The task rhythm boundary of task rhythm boundary ... ... the first assembly model 321a prediction of component model prediction.It is appreciated that the The input of N component model can also include either one or two of N component model left side or more in addition to the embedded feature including text The task rhythm boundary of a component model prediction.In this way, the second component model, third component model ... and N component model With other assemblies model foundation dependence or incidence relation.

Although it is appreciated that showing the second component model, third component model ... and N assembly mould in above-mentioned example Type is all with its respective all components model foundation on left side dependence or incidence relation, but this is not necessarily.For example, second group Each of part model, third component model ... and N component model component model, can with its respectively the left side one There are dependences between a or a few components model.In other words, the input of the component model includes the embedded of text The task rhythm boundary of the members model prediction on feature and its left side, rather than all components model prediction on its left side Task rhythm boundary.

Fig. 3 b shows the task layer of language rhythm Boundary Prediction model 300b in accordance with another embodiment of the present invention The schematic block diagram of 320b.In the component model in the task layer 320b of language rhythm Boundary Prediction model 300b, not group There are dependences all between its respectively all components model on the left side for part model.Such as the input of third component model 323b The task rhythm boundary of embedded feature and the second component model 322b prediction including text, but third component model 323b Input do not include first assembly model 321b prediction task rhythm boundary.

Although it is appreciated that showing the second component model, third component model ... and N assembly mould in above-mentioned example Type is all with other assemblies model foundation dependence or incidence relation, but this is also not necessarily.For example, can be in the second assembly mould There are a few components model among type, third component model ... and N component model, in a few components model Each there are dependences between its respectively at least one component model on the left side.In other words, a few components Model input include text embedded feature and its respectively the left side at least one component model prediction task rhythm side Boundary.

Fig. 3 c shows the task layer of the language rhythm Boundary Prediction model 300c of further embodiment according to the present invention The schematic block diagram of 320c.The task layer 320a and language rhythm Boundary Prediction model of language rhythm Boundary Prediction model 300a The function of the realization of the task layer 320c of 300c is similar with position, and details are not described herein.Unlike, language rhythm Boundary Prediction In component model in the task layer 320c of model 300c, and not all component model is all between the component model on its left side There are dependences.Such as second component model 322c input only including text embedded feature without include first assembly The task rhythm boundary of model 321c prediction.But the input of third component model 323c is in addition to the embedded spy including text Sign further includes the task rhythm boundary of the second component model 322c prediction.

Step S230, at least based on the task in addition to the task rhythm boundary that at least one other component model is predicted Rhythm boundary determines final rhythm boundary.It is appreciated that in this step, based at least one dependent on other assemblies model The task rhythm boundary that component model is predicted determines final rhythm boundary.

In one example, the final rhythm boundary can be obtained by merging the task rhythm boundary of multiple and different granularities ?.For example, final rhythm boundary can be first assembly model 321c, third component model 323c ... again by taking Fig. 3 c as an example The task rhythm boundary mergence predicted with N component model is resulting.

It is alternatively possible to merge the task rhythm boundary of all granularities of text, to determine the final rhythm boundary of text. For example, final rhythm boundary can be first assembly model 321a, the second component model 322a, third group again by taking Fig. 3 a as an example The task rhythm boundary mergence for the correspondence granularity that part model 323a ... and N component model are predicted respectively is resulting.

It is appreciated that each component model can obtain the task rhythm boundary of corresponding granularity.For some angle, Varigrained task rhythm boundary can independently indicate the rhythm boundary of text.For the same text, granularity is big The position on task rhythm boundary can be less, and comparatively, the position on the small task rhythm boundary of granularity can be more.Still with aforementioned text For this " prediction of main research rhythm structure herein ", the position according to the boundary of prosodic phrase granularity is one, in the rhythm Between phrase " main research herein " and " prediction of rhythm structure ".Boundary position according to rhythm word granularity is 5, is existed respectively Between rhythm word " this paper ", " main ", " research ", " rhythm ", " structure " and " prediction ".Merge the task rhythm of all granularities Boundary may include more rhythm boundary informations, and thus identified final rhythm boundary is more preferable.

Alternatively, final result can be the task rhythm boundary of any one higher granularity or based on the higher granularity What task rhythm boundary determined, predict that the task rhythm boundary is based not only on the embedded feature of text also based on lower granularity Task rhythm boundary.For example, final rhythm boundary can be the task rhythm that N component model is predicted again by taking Fig. 3 b as an example Boundary, wherein N component model is predicted pair based on the task rhythm boundary that (N-1) component model of lower granularity is predicted The task rhythm boundary answered.

It is appreciated that final rhythm boundary can be used for the application such as speech synthesis.

Above-mentioned technical proposal is respectively used to predict that the component model on varigrained task rhythm boundary is united at least two One carries out language rhythm Boundary Prediction under a frame, improves prediction effect.

In one example, the embedded feature that above-mentioned steps S210 extracts text includes following sub-step.

Sub-step S211, segments text, to obtain character level feature.

The various characters grade features such as Chinese character, participle, part of speech, word length and distance can be obtained by segmenting to text.These Character level feature, which can according to need, to be adjusted flexibly, such as is increased or deleted one or more.

In order to handle conveniently, various character level features can be respectively expressed as to solely hot (one-hot) type feature, that is, used One-hot coding.The one-hot coding is to be encoded using N bit status register to N number of state, each state There is its independent register-bit, and when any, wherein only an efficient coding, the coding are to convert class variable A form of process for being easy to utilize for machine learning algorithm.

Character level feature is carried out the processing of feature insertionization by sub-step S212.

The processing of feature insertionization can carry out dimensionality reduction to character level feature.For example, for literal dictionary, Chinese characters in common use are general In 5000 to 10000 scales, therefore the dimension of the one-hot vector of Chinese character is also probably 5000 to 10000.It can use insertion Change feature of the processing by one-hot ocra font ocr grade Feature Conversion for low latitudes.

Now handled with the feature insertionization of the one-hot ocra font ocr grade feature according to an embodiment of the invention by Chinese character For be described in detail sub-step S212.For example, the feature insertionization processing result of Chinese character can be determined according to the following formula:

EMB_cc=X_1×Ncc×W_Ncc×Dcc+B_cc,

Wherein, EMB_ccCharacter level feature for the Chinese character handled through feature insertionization, X_1×NccFor the one-hot type of Chinese character Character level feature, Ncc are dictionary size, and Dcc is insertion dimension, and W, B are model parameter.Model parameter can be according to correlation circumstance It is adjusted.For example, carrying out random initializtion before model training, model ginseng is carried out according to loss function in the training process Number adjustment.

In one example, the model for realizing step S212 can be feedforward neural network.

Similarly, the feature insertionization processing knot of other character level features such as participle, part of speech, word length and distance can be obtained Fruit, details are not described herein.

Sub-step S213 connects all character level features handled through feature insertionization, to obtain connection features.

It can feature insertionization place long to the Chinese character, participle, part of speech, the word that are obtained by connection sub-step S212 and distance It manages result and obtains connection features.The connection features include the information of all features of text.

Sub-step S214 extracts the embedded feature of the text based on the connection features.

Based on the connection features that sub-step S213 is obtained, feature can be reinforced by a full Connection Neural Network of multilayer It extracts, to obtain the embedded feature of the text.It is alternatively possible to using multilayer feedforward neural network (Multilayer Feedforward Neural Network, MFNN) complete sub-step S214.MFNN can be using tanh function as activation Function.Alternatively, it using convolutional neural networks (Convolutional Neural Network, CNN) or two-way can also follow Other neural fusions such as ring neural network (Bidirectional Recurrent Neural Network, B-RNN).

Fig. 4 shows the feature extraction layer 410 of language rhythm Boundary Prediction model 400 according to an embodiment of the invention Schematic block diagram.As shown in figure 4, in feature extraction layer, obtain the character level feature of text first, for example, Chinese character, participle, Part of speech, word length and distance etc..These character level features are separately input in feedforward neural network, to carry out feature insertionization place Reason.The character level feature handled through feature insertionization is connected through connector, thus to obtain connection features.Finally, utilizing MFNN base The embedded feature of text is extracted in connection features.

The mode of the embedded feature of said extracted text can obtain the input of more conducively machine learning, effectively increase The accuracy of language rhythm Boundary Prediction simultaneously reduces computing cost.

Again by taking Fig. 3 a as an example, the second component model 322a, third component model 323a ... in task layer 320a and The task rhythm boundary of the correspondence granularity of each of N component model part model prediction text is all based on embedded feature With the task rhythm boundary of its respective all components model prediction on the left side.Specifically for example, the second component model 322a prediction is appointed The task rhythm boundary that business rhythm boundary is predicted based on embedded feature and first assembly model 321a；Third component model 323a The task rhythm that prediction task rhythm boundary is predicted based on embedded feature, first assembly model 321a and the second component model 322a Restrain boundary；N component model predicts that task rhythm boundary is based on embedded feature, first assembly model 321a, the second assembly mould The task rhythm boundary of type 322a, third component model 323a ... and the prediction of (N-1) component model.

Since there are certain dependence or incidence relation in varigrained task rhythm boundary to a certain extent, in order to improve The effect of whole rhythm Boundary Prediction, component model are based on embedded on the task rhythm boundary of the correspondence granularity of prediction text Feature and all task rhythm boundaries than the correspondence granularity smaller particle size.An assembly mould is farthest established as a result, The dependence of the component model on the task rhythm boundary of type and other prediction smaller particle sizes takes full advantage of appointing for smaller particle size The information on business rhythm boundary facilitates the accuracy for promoting rhythm Boundary Prediction.

Illustratively, for the group on the task rhythm Boundary Prediction task rhythm boundary predicted based on other assemblies model Each of part model part model predicts the correspondence granularity of text using the embedded feature of the component model text based Task rhythm boundary includes the following steps.

Step S221, the task of the embedded feature of text based and the prediction of all other assemblies models that it is relied on The fusion feature of the rhythm Boundary Extraction correspondence granularity.In other words, for depending on other assemblies model prediction task rhythm side Each of the component model on boundary part model, the task rhythm of the embedded feature of text based and other assemblies model prediction The fusion feature that the component model corresponds to granularity is extracted on boundary.Referring again to task layer 320c shown in Fig. 3 c, wherein third Task rhythm boundary that the embedded feature of component model 323c text based and the second component model 322c are predicted is extracted The fusion feature of corresponding granularity.

The fusion feature of certain granularity is that the component model institute that is relied on based on embedded feature and corresponding component model is pre- The individual features of the task rhythm Boundary Extraction of survey.Non-linear conversion algorithm be can use to extract the fusion feature, such as base In FNN_tanhFunction etc. extracts fusion feature.Illustratively, the fusion feature of certain specified particle size can merge embedded feature With the information on all task rhythm boundaries than the specified particle size smaller particle size, the rhythm boundary for being more conducive to current granularity is pre- It surveys.

Illustratively, for the group on the task rhythm Boundary Prediction task rhythm boundary predicted based on other assemblies model Each of part model part model, step S221 specifically comprise the following steps.

Firstly, connecting embedded feature and the task rhythm of other assemblies model prediction that all component models are relied on Boundary is restrained, to obtain the linked character of the correspondence granularity.The linked character is associated with the embedded feature and all groups of text Part model received task rhythm boundary, includes the information of both.

Then, the linked character based on the correspondence granularity extracts the fusion feature of the correspondence granularity.Because of linked character packet Include text embedded feature and all component models both received task rhythm boundaries information, it is possible to from Linked character extracts the fusion feature of the correspondence granularity.The extraction algorithm of the fusion feature can use non-linear conversion method, Such as FNN_tanhFunction etc..

The technical solution obtains linked character by way of connection, not only ensure that the standard on the task rhythm boundary of prediction True property, and then ensure that the accuracy on final rhythm boundary, and easy to accomplish.

Step S222 determines the correspondence granularity of text using the component model based on the fusion feature of the correspondence granularity Task rhythm boundary.

As previously mentioned, the fusion feature of the correspondence granularity has merged embedded feature and all component models are relied on The information on the task rhythm boundary of other assemblies model prediction, can be more quasi- using the component model based on the fusion feature Really determine the task rhythm boundary of the correspondence granularity of text.

Illustratively, said modules model is neural network component model.

It is appreciated that the rhythm boundary based on neural network component model prediction text, can use neural network oneself Learning ability, thus, it is possible to obtain more accurate rhythm boundary result.

Illustratively, above-mentioned neural network component model includes two-way shot and long term memory network (BLSTM) and condition random Field (CRF) model.BLSTM-CRF model belongs to prosody prediction frame end to end, can not only more accurately predict the rhythm Boundary, and it is unrelated with language, it can predict the rhythm boundary of the text of each language.

Alternatively, above-mentioned neural network component model can be convolutional neural networks, Recognition with Recurrent Neural Network (Recurrent Neural Network, RNN), gating cycle unit (Gated Recurrent Unit, GRU) and shot and long term memory network Any of neural networks such as (Long Short-Term Memory, LSTM).In one example, from another angle Degree is utilized respectively the correspondence of the embedded feature prediction text of each of at least two component models part model text based The task rhythm boundary of granularity includes the following steps.

Step S221 ' predicts the task of the first granularity of the text using first assembly model based on embedded feature Rhythm boundary.

It is appreciated that the granularity on the task rhythm boundary of the first granularity is minimum.First assembly model is independent of other groups Part model is based only upon the task rhythm boundary of the first granularity of embedded feature prediction text.

Optionally, predicted based on embedded feature the task rhythm boundary of the first granularity of the text including the use of BLSTM-CRF model executes the prediction task of the first granularity.

The task rhythm boundary of the first granularity is determined based on embedded feature according to the following formula:

First_pred=BLSTM-CRF_first(FEAT_embed),

Wherein, First_predIndicate the task rhythm boundary of the first granularity, FEAT_embedIndicate embedded feature, BLSTM- CRF_firstIndicate the BLSTM-CRF model of the first granularity.

Step S222 ', the task rhythm boundary using the second component model based on embedded feature and first granularity Predict the task rhythm boundary of the second granularity of the text.The step may include following sub-step.

Sub-step 1 connects the task rhythm boundary of the embedded feature and first granularity, to obtain the second granularity Linked character.

The linked character of the second granularity can be determined according to the following formula:

Cancat_second=FEAT_embed；First_pred,

Wherein, Cancat_secondIndicate the task rhythm boundary of the second granularity, First_predIndicate the task of the first granularity Rhythm boundary, FEAT_embedIndicate embedded feature.

Sub-step 2, the linked character based on the second granularity extract the fusion feature of the second granularity.

The fusion feature of the second granularity can be determined according to the following formula:

Second_in=FNN_tanh(Cancat_second),

Wherein, Second_inIndicate the fusion feature of the second granularity, Cancat_secondIndicate the linked character of the second granularity, FNN_tanhIt indicates using tanh as the feedforward neural network of activation primitive.

Sub-step 3 determines the second granularity of the text using the second component model based on the fusion feature of the second granularity Task rhythm boundary.

It is similar with the formula on task rhythm boundary of the first granularity of determination in sub-step 1, the is determined according to the following formula The task rhythm boundary of two granularities:

Second_pred=BLSTM-CRF_second(Second_in),

Wherein, Second_predIndicate the task rhythm boundary of the second granularity, Second_inFor the fusion feature of the second granularity, BLSTM-CRF_secondIndicate the BLSTM-CRF model of the second granularity.

Step S223 ', using third component model based on embedded feature, first granularity task rhythm boundary and The task rhythm boundary of the third granularity of text described in the task rhythm Boundary Prediction of second granularity.

The task rhythm Boundary Prediction step of third granularity is similar with the task rhythm Boundary Prediction step of the second granularity, tool Body calculating process is as follows:

Cancat_third=FEAT_embed；Second_pred；First_pred,

Third_in=FNN_tanh(Cancat_third),

Third_pred=BLSTM-CRF_third(Third_in),

Wherein, Cancat_thirdFor the linked character of third granularity, Third_inFor the fusion feature of third granularity, Third_predFor the task rhythm boundary of third granularity, BLSTM-CRF_thirdIndicate the BLSTM-CRF model of third granularity.

Illustratively, above-mentioned first granularity is rhythm word granularity, and the second granularity is prosodic phrase granularity, and third granularity is language Adjust phrase granularity.

It is appreciated that rhythm Boundary Prediction is carried out based on three above-mentioned rhythm word, prosodic phrase and intonation phrase granularities, it can Reasonably to divide the rhythm boundary of text, meet the needs of speech synthesis.

In order to illustrate more clearly of the present invention, Fig. 5 shows the language rhythm boundary of another embodiment according to the present invention The schematic block diagram of prediction model.As shown in figure 5, language rhythm Boundary Prediction model includes feature extraction layer 510, task layer 520 and 530 three parts of result output layer.The wherein function of feature extraction layer 510, position and structure and above language rhythm boundary Feature extraction layer 410 in prediction model 400 is similar, and details are not described herein.

Task layer 520 includes first assembly model 521, the second component model 522 and third component model 523.

First assembly model 521 predicts the first granularity for executing above-mentioned steps S221 ', the embedded feature of text based Task rhythm boundary.First granularity can be rhythm word granularity, and the granularity is minimum.

Second component model 522 is for executing above-mentioned steps S222 ', the embedded feature of text based and first assembly mould The task rhythm boundary of the task rhythm the second granularity of Boundary Prediction for the first granularity that type 521 is predicted.Second fineness ratio first Degree is big.Second granularity can be prosodic phrase granularity.

Third component model 523 is the task based on embedded feature, the first granularity for executing above-mentioned steps S223 ' The task rhythm boundary of the third granularity of the task rhythm Boundary Prediction text of rhythm boundary and the second granularity.Third fineness ratio One granularity and the second granularity are all big.Third granularity can be intonation phrase granularity.

As a result output layer 530 is for merging first assembly model 521, the second component model 522 and third component model 523 The task rhythm boundary of these three component models prediction is to export the final rhythm boundary of text.

Above-mentioned technical proposal predicts the final rhythm boundary of text based on prediction of speech model 500, can obtain more quasi- True prediction result.In addition, the technical solution is highly suitable for the prediction of Chinese language text.

According to a further aspect of the invention, it additionally provides a kind of for language rhythm Boundary Prediction device.Fig. 6 shows basis The schematic block diagram for language rhythm Boundary Prediction device of one embodiment of the invention.

As shown in fig. 6, device 600 includes extraction module 610, prediction module 620 and determining module 630.

The modules can execute respectively the above each step for language rhythm Boundary Prediction method/ Function.Only the major function of each component of the device 600 is described below, and is omitted in the details having been described above Hold.

Extraction module 610, for extracting the embedded feature of text.

Prediction module 620 is utilized respectively each of at least two component models part model and is mentioned based on extraction module 610 The embedded feature taken predicts the task rhythm boundary of corresponding granularity, wherein at least one component model predicts corresponding granularity The task rhythm boundary that task rhythm boundary is also predicted based at least one other component model, at least one described assembly mould Type is bigger than the granularity on the task rhythm boundary that at least one other component model is predicted.

Determining module 630, at least based in addition to the task rhythm boundary that at least one other component model is predicted Task rhythm boundary determines final rhythm boundary.

Fig. 7 shows the schematic frame according to an embodiment of the invention for language rhythm Boundary Prediction system 700 Figure.As shown in fig. 7, system 700 includes input unit 710, storage device 720, processor 730 and output device 740.

The input unit 710 is used to receive the operational order that user is inputted and acquisition data.Input unit 710 can To include one or more of keyboard, mouse, microphone, touch screen and image collecting device etc..

The storage of storage device 720 is for realizing in language rhythm Boundary Prediction method according to an embodiment of the present invention The computer program instructions of corresponding steps.

The processor 730 is for running the computer program instructions stored in the storage device 720, to execute basis The corresponding steps of the language rhythm Boundary Prediction method of the embodiment of the present invention, and for realizing use according to an embodiment of the present invention Extraction module 610, prediction module 620 and determining module 630 in language rhythm Boundary Prediction device.

The output device 740 is used to export various information (such as image and/or sound) to external (such as user), and It and may include one or more of display, loudspeaker etc..

In one embodiment, the system 700 is made when the computer program instructions are run by the processor 730 Execute following steps:

Extract the embedded feature of text；

It is utilized respectively each of at least two component models part model and corresponding granularity is predicted based on embedded feature Task rhythm boundary, wherein at least one component model predicts that the task rhythm boundary of corresponding granularity is also based at least one its The task rhythm boundary that his component model is predicted, at least one described component model is than at least one other component model The granularity on the task rhythm boundary predicted is big；

At least based on the task rhythm boundary in addition to the task rhythm boundary that at least one other component model is predicted Determine final rhythm boundary.

In addition, another aspect according to the present invention, additionally provides a kind of storage medium, stores journey on said storage Sequence instruction makes the computer or processor execute the present invention real when described program instruction is run by computer or processor The corresponding steps of the above-mentioned language rhythm Boundary Prediction method of example are applied, and for realizing upper predicate according to an embodiment of the present invention Say the corresponding module or the above-mentioned corresponding module in language rhythm Boundary Prediction system in rhythm Boundary Prediction device.It is described Storage medium for example may include the hard disk, read-only of the storage card of smart phone, the storage unit of tablet computer, personal computer Memory (ROM), Erasable Programmable Read Only Memory EPROM (EPROM), portable compact disc read-only memory (CD-ROM), USB Any combination of memory or above-mentioned storage medium.The computer readable storage medium can be one or more calculating Any combination of machine readable storage medium storing program for executing.

In one embodiment, when the computer program instructions are run by computer or processor, so that the calculating Machine or processor execute following steps:

Extract the embedded feature of text；

Above-mentioned language rhythm Boundary Prediction scheme is respectively used to predict varigrained task rhythm boundary at least two Component model be unified under a frame progress language rhythm Boundary Prediction, improve prediction effect.

Although describing example embodiment by reference to attached drawing here, it should be understood that above example embodiment are only exemplary , and be not intended to limit the scope of the invention to this.Those of ordinary skill in the art can carry out various changes wherein And modification, it is made without departing from the scope of the present invention and spiritual.All such changes and modifications are intended to be included in appended claims Within required the scope of the present invention.

Those of ordinary skill in the art may be aware that list described in conjunction with the examples disclosed in the embodiments of the present disclosure Member and algorithm steps can be realized with the combination of electronic hardware or computer software and electronic hardware.These functions are actually It is implemented in hardware or software, the specific application and design constraint depending on technical solution.Professional technician Each specific application can be used different methods to achieve the described function, but this realization is it is not considered that exceed The scope of the present invention.

In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.For example, apparatus embodiments described above are merely indicative, for example, the division of the unit, only Only a kind of logical function partition, there may be another division manner in actual implementation, such as multiple units or components can be tied Another equipment is closed or is desirably integrated into, or some features can be ignored or not executed.

In the instructions provided here, numerous specific details are set forth.It is to be appreciated, however, that implementation of the invention Example can be practiced without these specific details.In some instances, well known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this specification.

Similarly, it should be understood that in order to simplify the present invention and help to understand one or more of the various inventive aspects, To in the description of exemplary embodiment of the present invention, each feature of the invention be grouped together into sometimes single embodiment, figure, Or in descriptions thereof.However, the method for the invention should not be construed to reflect an intention that i.e. claimed The present invention claims features more more than feature expressly recited in each claim.More precisely, such as corresponding power As sharp claim reflects, inventive point is that the spy of all features less than some disclosed single embodiment can be used Sign is to solve corresponding technical problem.Therefore, it then follows thus claims of specific embodiment are expressly incorporated in this specific Embodiment, wherein each, the claims themselves are regarded as separate embodiments of the invention.

It will be understood to those skilled in the art that any combination pair can be used other than mutually exclusive between feature All features disclosed in this specification (including adjoint claim, abstract and attached drawing) and so disclosed any method Or all process or units of equipment are combined.Unless expressly stated otherwise, this specification (is wanted including adjoint right Ask, make a summary and attached drawing) disclosed in each feature can be replaced with an alternative feature that provides the same, equivalent, or similar purpose.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments mean it is of the invention Within the scope of and form different embodiments.For example, in detail in the claims, embodiment claimed it is one of any Can in any combination mode come using.

Various component embodiments of the invention can be implemented in hardware, or to run on one or more processors Software module realize, or be implemented in a combination thereof.It will be understood by those of skill in the art that can be used in practice Microprocessor or digital signal processor (DSP) are according to an embodiment of the present invention for language rhythm Boundary Prediction dress to realize The some or all functions of some modules in setting.The present invention is also implemented as executing method as described herein Some or all program of device (for example, computer program and computer program product).Such realization is of the invention Program can store on a computer-readable medium, or may be in the form of one or more signals.Such signal It can be downloaded from an internet website to obtain, be perhaps provided on the carrier signal or be provided in any other form.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and ability Field technique personnel can be designed alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between parentheses should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" located in front of the element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.In the unit claims listing several devices, several in these devices can be through the same hardware branch To embody.The use of word first, second, and third does not indicate any sequence.These words can be explained and be run after fame Claim.

The above description is merely a specific embodiment or to the explanation of specific embodiment, protection of the invention Range is not limited thereto, and anyone skilled in the art in the technical scope disclosed by the present invention, can be easily Expect change or replacement, should be covered by the protection scope of the present invention.Protection scope of the present invention should be with claim Subject to protection scope.

Claims

1. a kind of language rhythm Boundary Prediction method, comprising:

Extract the embedded feature of text；

It is utilized respectively each of at least two component models part model and corresponding granularity is predicted based on the embedded feature Task rhythm boundary, wherein at least one component model predicts that the task rhythm boundary of corresponding granularity is also based at least one its The task rhythm boundary that his component model is predicted, at least one described component model is than at least one other component model The granularity on the task rhythm boundary predicted is big；And

2. the method for claim 1, wherein in addition to the rhythm Boundary Prediction task for realizing minimum particle size Each component model outside component model, the component model predict that the task rhythm boundary of the correspondence granularity of the text is to be based on The embedded feature and all task rhythm boundaries than the correspondence granularity smaller particle size.

3. method according to claim 1 or 2, wherein for each of at least one described component model part model, Predict that the task rhythm boundary for corresponding to granularity includes: based on the embedded feature using the component model

This is right for the task rhythm Boundary Extraction predicted based on the embedded feature and at least one other component model Answer the fusion feature of granularity；

Based on the fusion feature of the correspondence granularity, the task rhythm side of the correspondence granularity of the text is determined using the component model Boundary.

4. method as claimed in claim 3, wherein described to be based on the embedded feature and at least one other component The fusion feature of the task rhythm Boundary Extraction correspondence granularity of model prediction includes:

The embedded feature is connected and task rhythm boundary that at least one other component model is predicted, with obtain should The linked character of corresponding granularity；

5. method according to claim 1 or 2, wherein described at least to be predicted based at least one described component model The task rhythm boundary of corresponding granularity determines that final rhythm boundary includes:

Merge the task rhythm boundary of all granularities of the text, with the final rhythm boundary of the determination text.

6. method according to claim 1 or 2, wherein described to be utilized respectively each of at least two component models part Model predicts that the task rhythm boundary for corresponding to granularity includes: based on the embedded feature

The task rhythm boundary of the first granularity of the text is predicted based on the embedded feature using first assembly model；

Utilize text described in task rhythm Boundary Prediction of second component model based on the embedded feature and first granularity The task rhythm boundary of this second granularity；And

Using third component model based on the embedded feature, the task rhythm boundary of first granularity and second described The task rhythm boundary of the third granularity of text described in the task rhythm Boundary Prediction of degree.

7. method as claimed in claim 6, wherein first granularity is rhythm word granularity, and second granularity is the rhythm Phrase granularity, the third granularity are intonation phrase granularities.

8. a kind of language rhythm Boundary Prediction device, comprising:

Extraction module, for extracting the embedded feature of text；

Prediction module is utilized respectively each of at least two component models part model and is based on the embedded feature prediction pair Answer the task rhythm boundary of granularity, wherein at least one component model predict the task rhythm boundary of corresponding granularity be also based on to The task rhythm boundary of few other assemblies model prediction, at least one described component model are more at least one other than described The granularity on the task rhythm boundary that component model is predicted is big；

Determining module, at least based on the task in addition to the task rhythm boundary that at least one other component model is predicted Rhythm boundary determines final rhythm boundary.

9. a kind of language rhythm Boundary Prediction system, comprising: processor and memory, wherein be stored with meter in the memory Calculation machine program instruction, which is characterized in that for executing as right is wanted when the computer program instructions are run by the processor Seek 1 to 7 described in any item language rhythm Boundary Prediction methods.

10. a kind of storage medium, stores program instruction on said storage, which is characterized in that described program instruction exists For executing language rhythm Boundary Prediction method as described in any one of claim 1 to 7 when operation.