CN106683667A

CN106683667A - Automatic rhythm extracting method, system and application thereof in natural language processing

Info

Publication number: CN106683667A
Application number: CN201710023633.8A
Authority: CN
Inventors: 陈彦局; 潘嵘; 李双印
Original assignee: Shenzhen Ipin Information Technology Co Ltd
Current assignee: Shenzhen Ipin Information Technology Co Ltd
Priority date: 2017-01-13
Filing date: 2017-01-13
Publication date: 2017-05-17

Abstract

The invention relates to an automatic rhythm extracting method, a system and an application thereof in natural language processing; the method includes steps of applying an automatic text-voice alignment technology to generate a large-scale rhythm data set, and applying a circular neural network to perform modeling on the rhythm of a sentence, adding a bidirectional expanding mechanism; applying the automatically structured text rhythm data to a natural language processing task based on the circular neural network. The method fully uses the isomorphism properties of common sequence data in the text rhythm sequence and the natural language processing task; through an alternative training method under the multi-task study, the natural language processing task is promoted without the assistance of artificially explicit marked semantic information. The practice of the method can overcome shortcomings of low efficiency, different standards, incapability of large-scale application of the artificial rhythm marking; meanwhile, the method can transfer semantics and pragmatics in massive voice data to the other tasks.

Description

A kind of automatic rhythm extracting method, system and its in natural language processing task Using

Technical field

The present invention relates to a kind of phonetic-rhythm extracting method, more particularly to a kind of automatic rhythm extracting method, system And its application in natural language processing task.

Background technology

The rhythm in voice can react the meaning of speaker by different terms in imparting sentence with different saliences Scheme, therefore rhythm salience is considered as the Semantic and pragmatic that understand voice have indicative effect, the rhythm of voice Mainly include the information such as liaison, sense-group pause, stressed, rising-falling tone.And in addition to voice, text is used as can express Semantic With another kind of form of pragmatic, its prosody characteristics for including can be to be understood and learnt by different readers, i.e. text The prosody characteristics of itself are contained, this characteristic can be learnt and prediction, while this rhythm for including can be for other certainly So language processing tasks provide Semantic and the guidance on pragmatic, and then lift their performance.Implicit expression in text data The rhythm cannot directly be observed and obtained, therefore, the rhythm of simultaneously its correspondence text of labelling can only be obtained from speech data, and then Algorithm Learning could be allowed how the rhythm to be perceived and predicted from plain text, so as to for other natural language processing tasks provide except There is the guidance beyond the syntactic information of supervision.

Most current natural language processing framework with word and its represents (term vector) as ultimate unit, and in voice Prosody characteristics show as continuous characteristic sequence, and voice does not have obvious word segmentation point, adds based on speech recognition The accurate word rhythm of technology extracts language material and the training that cannot obtain extensive high-quality, causes most of for voice at present The extraction of prosody characteristics and the method for utilizing be required to the people of expertise manually split sound bite, alignment voice with Text, mark word prosodic features etc. so that have the generation process efficiency of monitoring data low.

There is following pertinent literature in prior art：

1)Brenier,J.M.；Cer,D.M.；and Jurafsky,D.2005.The detection of emphatic words using acoustic and lexical features.In INTERSPEECH,3297-3300.

2)Brenier,J.M.2008.The Automatic Prediction of Prosodic Prominence from Text.ProQuest.

There is provided method and its corresponding evaluation index that the rhythm is predicted using plain text.Document has used ToBI tool sets Artificial segmentation is carried out to voice and its correspondence text to mark with rhythm salience, according to the corresponding phonetic feature of various words, Such as：Pronunciation duration (duration), intensity of phonation (intensity), the maximin of pronunciation base frequency (fundamental frequency minima and maxima) etc. generates text judging whether it is highlighted Rhythm data set.Document has used maximum entropy classifiers to learn and predict the rhythm of text simultaneously, is only using text In the case of feature, grader can reach 79% or so predictablity rate.The rhythm number that document above will not generated It is applied to aid in other natural language processing tasks according to collection.

An other pertinent literature：

3)Hovy,D.；Anumanchipalli,G.K.；Parlikar,A.；Vaughn,C.；Lammert,A.；Hovy, E.；and Black,A.W.2013.Analysis and Modeling of“Focus”in Context.In INTERSPEECH,402-406.

There is provided a kind of utilization plain text from the method for the context-prediction rhythm.Document makes on the basis of related work The prediction of the text rhythm is facilitated with context, and has used the method for mass-rent (crowdsourcing) to carry out a set pattern The artificial rhythm data set mark of mould.

In above-named three pertinent literatures, it is required to manually be labeled word rhythm attribute bar none, Need the segmentation that carries out voice before mark simultaneously and its align with text, this is caused in efficiency to the generation of data set Limit so that the method cannot at short notice obtain a large amount of labeled data, thus the method mentioned in the document of upper section lacks Weary actual effect, it is impossible to apply in actual production.Meanwhile, the data set sample size that above method is produced is not enough to cover all rhythms The problem space of rule prediction so that algorithm extensibility is not strong, causes using the not enough situation of upper performance.

Therefore, not being found in prior art can extract the side of the corresponding prosody characteristics of word from voice automatically Method, its whole is and manually carry out manual extraction, while in existing pertinent literature, not being found any use voice correspondence Text prosody characteristics assisting natural language processing tasks record or practical application, in this specific category, the present invention provide First feasible method.

The content of the invention

It is contemplated that at least solving one of technical problem present in prior art.

For this purpose, it is an object of the present invention to proposing that a kind of efficient rhythm automatically is extracted and its appointed in natural language processing The method applied in business, this method can overcome Traditional Man mark poorly efficient, standard differ, cannot large-scale application lack Fall into, while the semanteme and pragmatic characteristic that can will be present in a large amount of speech datas are moved in other tasks, exist as one kind Unsupervised data genaration mode on mark, the present invention can effectively utilize the rhythm model in voice, to other natural languages The performance of speech process task is improved.

For achieving the above object, the invention provides a kind of automatic speech rhythm extract mask method, the method include as Lower step：

Step 1, receives speech data to be marked, obtains the corresponding text of the speech data；

Step 2, time shafts are carried out using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text On alignment, formed aligning texts；

The aligning texts are carried out sentence segmentation by step 3, so as to generate the sample in units of sentence；

Step 4, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to construct and obtains The text rhythm data set of automatic marking, wherein, rhythm salience mark (or prosodic labeling of sentence) of the sentence is referred to The corresponding sequence of values of sentence, the sequence reflects the rhythm that sentence different piece (or elementary cell) has by numerical values recited Salience intensity.

More specifically, described in the step 2 alignment that speech data and its correspondence text is carried out on time shafts is concrete Refer to：The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, it is described so as to obtain The corresponding speech data fragment of each elementary cell in text, wherein, the elementary cell refers to the word or word of Chinese, English Word.

More specifically, the step 4 also includes：If bright comprising multiple declaimers or multiple differences in primary voice data The environment of reading, then need to do the pronunciation of different declaimers custom respectively standardization, and by the rhythm of the speech data Rule feature carries out sliding-model control.

According to a further aspect in the invention, a kind of automatic rhythm extracting method is additionally provided in natural language processing task Application, the method includes：

Using the rhythm of text data as a sequence labelling task, using shot and long term memory artificial neural network (LSTM) For rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point prediction simultaneously The rhythm salience mark of output current location elementary cell.

More specifically, the LSTM models extend to the circulation of two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Neutral net and its derived type and structure etc..

More specifically, the method also includes：

Text rhythm data set is used for based on the sentence compression duty of Recognition with Recurrent Neural Network (RNN)：The text rhythm is dashed forward Going out property is marked as nonproductive task, using sentence compression duty as main task, using the alternating training side under multi-task learning Formula, each time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input Another task, two tasks alternately, until the model convergence.

More specifically, the method also includes：

Text rhythm data set is used for into auxiliary based on Recognition with Recurrent Neural Network and its natural language of related expanding improved structure Speech process task：Using text rhythm salience mark as nonproductive task, using sentence compression duty as main task, using many Alternating training method under tasking learning, each time period is compressed to a part of text rhythm data of the mode input or sentence Data, the next time period is input into another task, two tasks alternately, by optimizing the model parameter, until The model convergence.

According to a further aspect in the invention, additionally provide a kind of automatic speech rhythm and extract labeling system, the system includes：

Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data；

Alignment module, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and its text Alignment, formed aligning texts；

The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence；

Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, from And the text rhythm data set of automatic marking is constructed and obtains, wherein, the rhythm salience mark of the sentence (or the rhythm of sentence Rule mark) the corresponding sequence of values of sentence is referred to, the sequence reflects sentence different piece (or elementary cell) by numerical values recited The rhythm salience intensity being had.

More specifically, described in the alignment module speech data and its correspondence text carries out the alignment tool on time shafts Body is referred to：The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, so as to obtain The corresponding speech data fragment of each elementary cell in text is stated, wherein, the elementary cell refers to the word or word of Chinese, English The word of text.

More specifically, the segmentation module is additionally operable to：

If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to read aloud difference Person's pronunciation custom does respectively standardization, and as needed prosodic features is carried out into sliding-model control.

The present invention has following Advantageous Effects：

1) autotext-voice alignment techniques have been used to carry out the generation of extensive rhythm data set, after alignment Sound bite can construct tool as rhythm index by the mark quality control of rhythm salience on the basis of some strength There is the text rhythm data set of Weakly supervised characteristic, compared to traditional artificial mark means, in addition to advantage in hgher efficiency, Traditional approach is also significantly better than in autgmentability, can at any time add priori with adjust data set actual annotation results and Performance, processing speed is fast, low cost, and the data (same time of flood tide is constructed in the case of saving a large amount of human resourcess Interior generation data volume is more than traditional method more than two orders of magnitude).

2) present invention uses Recognition with Recurrent Neural Network is modeled to the rhythm of sentence, add two-way extension mechanism it Afterwards, Recognition with Recurrent Neural Network can effectively consider the context state of word, accurate for the prediction of word rhythm salience mark Rate can reach more than 90%, be significantly better than traditional maximum entropy method, while carrying out feature extraction without the need for expertise, reduce special While levying engineering, flow process more conforms to the process of human cognitive.

3) present invention is used for the text rhythm data set of automatic construction based on the natural language processing of Recognition with Recurrent Neural Network In task.

The method takes full advantage of the isomorphism of the most preferred sequence data in text rhythm sequence and natural language processing task Characteristic, by the alternating training method under multi-task learning so that natural language processing task need not marked explicitly Get a promotion under the auxiliary of semantic information.In the example of sentence compression duty, the method for the present invention has relative to prior art Significant performance boost (more than 10% performance boost).

The additional aspect and advantage of the present invention will be given in following description section, partly will be become from the following description Obtain substantially, or recognized by the practice of the present invention.

Description of the drawings

The above-mentioned and/or additional aspect and advantage of the present invention will become from the description with reference to accompanying drawings below to embodiment It is substantially and easy to understand, wherein：

Fig. 1 shows the flow chart that mask method is extracted according to a kind of automatic speech rhythm of the invention；

Multitask LSTM models treated mode figure of the invention is shown in Fig. 2；

The two-way LSTM models treated mode figure of multitask of the invention is shown in Fig. 3；

Fig. 4 shows that a kind of automatic speech rhythm of the invention extracts the system block diagram of labeling system.

Specific embodiment

It is below in conjunction with the accompanying drawings and concrete real in order to be more clearly understood that the above objects, features and advantages of the present invention Apply mode to be further described in detail the present invention.It should be noted that in the case where not conflicting, the enforcement of the application Feature in example and embodiment can be mutually combined.

Many details are elaborated in the following description in order to fully understand the present invention, but, the present invention may be used also Implemented with being different from mode described here using other, therefore, protection scope of the present invention does not receive following public tool The restriction of body embodiment.

Fig. 1 shows the flow chart that mask method is extracted according to a kind of automatic speech rhythm of the invention.

As shown in figure 1, a kind of automatic speech rhythm of the invention extracts mask method, the method includes following step Suddenly：

Step 1, receives speech data to be marked, obtains the corresponding text of the speech data.

Specifically, a period of time axle on the elementary cell correspondence speech data in each text can be passed through, from And obtain the corresponding speech data fragment of each elementary cell in the text.Wherein, elementary cell, refer to word in Chinese or Word, a word in English.

Additionally, Text-To-Speech alignment techniques are including but not limited to by each elementary cell in the acquisition speech data The initiator corresponding time is played to the pronunciation corresponding time is terminated, so as to obtain each elementary cell institute in the speech data Time period between a period of time axle and elementary cell.

The aligning texts are carried out sentence segmentation by step 3, generate the sample in units of sentence.

For example, the punctuation mark characteristic according to sentence is can be, but not limited to, sentence segmentation is carried out to aligning texts, made Obtain each sentence to be made up of the elementary cell for being accompanied with correspondence speech data fragment.

Step 4, the automatic rhythm salience dimensioning algorithm of each sentence application after splitting to the sentence in text, so as to Construct and obtain the text rhythm data set of automatic marking.

Specifically, also include in this step：If bright comprising multiple declaimers or multiple differences in primary voice data The environment of reading, then need to do standardization respectively to different declaimers pronunciation custom, to eliminate impact therein, and according to need Sliding-model control is carried out to the prosodic features of speech data.Wherein, prosodic features refer to the elementary cell UL, The maxima and minima of intensity of phonation, pronunciation base frequency.

To the automatic rhythm salience dimensioning algorithm of each sentence application in text after sentence segmentation, can select The some or all of feature in three prosodic features is stated as the input of automatic rhythm salience dimensioning algorithm, wherein, it is described Rhythm salience mark (or prosodic labeling of sentence) of sentence refers to the corresponding sequence of values of sentence, and the sequence is big by numerical value The rhythm salience intensity that little reflection sentence different piece (or elementary cell) has.

According to the second aspect of the invention, additionally provide a kind of automatic rhythm and be extracted in answering in natural language processing task With method, the application process includes：

Using for the rhythm of text data is used as a sequence labelling task, artificial neural network is remembered using shot and long term (LSTM) for rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point Predict and export the rhythm salience mark of current location elementary cell.

More specifically, the LSTM models extend to the circulation of two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Neutral net and its derived type and structure, such as gate duration recirculating network (Gated Recurrent Network, GRN) etc..

More specifically, the application process also includes：

Text rhythm data set is used for based on the sentence compression duty of Recognition with Recurrent Neural Network (RNN)：The text rhythm is dashed forward Going out property is marked as nonproductive task, using sentence compression duty as main task, using the alternating training side under multi-task learning Formula, each time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input Another task, two tasks alternately, until the model convergence.Multitask of the invention is shown in Fig. 2 LSTM models treated modes, text rhythm salience mark as nonproductive task, the output of corresponding A number of Node, and sentence pressure Contracting task is used as main task, the output of correspondence Y-series node.By the way of alternately training, each time period is defeated to model Enter a part of text rhythm salience mark task data or sentence compressed data, the next time period is input into another Business, two tasks alternately, until model convergence.Show in Fig. 3 at the two-way LSTM models of multitask of the invention Reason mode.

More specifically, the application process also includes：

Text rhythm data set is used for based on the natural language processing task of Recognition with Recurrent Neural Network：The text rhythm is projected Property mark as nonproductive task, using sentence compression duty as main task, using the alternating training method under multi-task learning, Each time period is input in addition to a part of text rhythm data of the mode input or sentence compressed data, next time period One task, two tasks alternately, by optimizing the model parameter, until model convergence.Wherein, nerve is circulated Network includes but is not limited to LSTM, GRU and its extension in depth.

For aforesaid way can use Formal Language Description, if X is the text sequence of input, A is text sequence correspondence Rhythm salience sequence, Y is the corresponding compact token of text, and three sequences correspond to following form：

X=(x₁..., x_N),

A=(a₁..., a_N)

Y=(y₁..., y_N)

Above-mentioned task actually optimizes following problem：

For LSTM models (on), p can be expressed as：

For two-way LSTM models (under), p can be expressed as：

Wherein,

Using the parameter θ * after optimization, the rhythm salience A prediction outputs of model are expressed as：

In the same manner for major prognostic task Y of model, the expression formula of isomorphism can be obtained, be repeated no more here.

As shown in figure 4, the system includes：

More specifically, the segmentation module is additionally operable to：

If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to read aloud difference Person's pronunciation custom does respectively standardization, and as needed the prosodic features of the speech data is carried out at discretization Reason.

The present invention by autotext-voice alignment techniques, by sound bite with to answering word to align in text, profit With sound bite as word rhythm salience index, so as to obtain a large amount of text rhythm numbers with mark for automatically generating According to structure text rhythm data set.

Meanwhile, the present invention utilizes Weakly supervised characteristic, by text rhythm data set using the mode of multi-task learning, is circulating Under the model structure of neutral net, carry out replacing training with other natural language processing tasks, so as to reach other tasks are improved The purpose of performance.

In the description of this specification, the description of term " one embodiment ", " specific embodiment " etc. means to combine the reality Specific features, structure or the feature for applying example or example description is contained at least one embodiment of the present invention or example.At this In description, identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.And, description it is concrete Feature, structure or feature can in an appropriate manner be combined in any one or more embodiments or example.

The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims

1. a kind of automatic speech rhythm extracts mask method, it is characterised in that the method comprises the steps：

Step 2, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text Alignment, forms aligning texts；

Step 4, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to construct and obtains automatically The text rhythm data set of mark.

2. a kind of automatic speech rhythm according to claim 1 extracts mask method, it is characterised in that in the step 2 The alignment that described speech data and its correspondence text is carried out on time shafts is specifically referred to：So that the elementary cell in each text A period of time axle on the speech data can be corresponded to, so as to obtain the text in the corresponding voice number of each elementary cell According to fragment, wherein, the elementary cell refers to the word or word of Chinese, the word of English.

3. a kind of automatic speech rhythm according to claim 1 extracts mask method, it is characterised in that the step 4 is also Including：If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to different declaimers Pronunciation custom do standardization respectively, and the prosodic features of the speech data are carried out at discretization as needed Reason.

4. a kind of a kind of automatic rhythm extracting method as described in any one of claim 1-3 is in natural language processing task Using, it is characterised in that the method includes：

Using the rhythm of text data as a sequence labelling task, artificial neural network is remembered using shot and long term（LSTM）For Rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point prediction and is exported The rhythm salience mark of current location elementary cell.

5. application of a kind of automatic rhythm extracting method according to claim 4 in natural language processing task, it is special Levy and be, the LSTM models extend to two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Recognition with Recurrent Neural Network and Its derived type and structure etc..

6. application of a kind of automatic rhythm extracting method according to claim 5 in natural language processing task, it is special Levy and be, the method also includes：

Text rhythm data set is used to be based on Recognition with Recurrent Neural Network（RNN）Sentence compression duty：By text rhythm salience Mark as nonproductive task, using sentence compression duty as main task, using the alternating training method under multi-task learning, often The individual time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input other Individual task, two tasks alternately, until the model convergence.

7. application of a kind of automatic rhythm extracting method according to claim 5 in natural language processing task, it is special Levy and be, the method also includes：

The natural language processing that text rhythm data set is used for based on Recognition with Recurrent Neural Network and its related expanding improved structure is appointed Business：The salience of the text rhythm is marked as nonproductive task, using sentence compression duty as main task, using multi-task learning Under alternating training method, each time period to a part of text rhythm data of the mode input or sentence compressed data, under One time period is input into another task, two tasks alternately, by optimizing the model parameter, until the model Convergence.

8. a kind of automatic speech rhythm extracts labeling system, it is characterised in that the system includes：

Alignment module, the speech data that collects and its text are carried out using Text-To-Speech alignment techniques right on time shafts Together, aligning texts are formed；

Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to structure Make and obtain the text rhythm data set of automatic marking.

9. a kind of automatic speech rhythm according to claim 8 extracts labeling system, it is characterised in that the alignment module Described in speech data and its correspondence alignment that carries out on time shafts of text specifically refer to：So that substantially single in each text Unit can correspond to a period of time axle on the speech data, so as to obtain the text in each corresponding voice of elementary cell Data slot, wherein, the elementary cell refers to the word or word of Chinese, the word of English.

10. a kind of automatic speech rhythm according to claim 8 extracts labeling system, it is characterised in that the segmentation mould Block is additionally operable to：

If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, need to send out different declaimers Sound custom does respectively standardization, and as needed the prosodic features of the speech data is carried out into sliding-model control.