CN106683667A - Automatic rhythm extracting method, system and application thereof in natural language processing - Google Patents
Automatic rhythm extracting method, system and application thereof in natural language processing Download PDFInfo
- Publication number
- CN106683667A CN106683667A CN201710023633.8A CN201710023633A CN106683667A CN 106683667 A CN106683667 A CN 106683667A CN 201710023633 A CN201710023633 A CN 201710023633A CN 106683667 A CN106683667 A CN 106683667A
- Authority
- CN
- China
- Prior art keywords
- rhythm
- text
- data
- sentence
- automatic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000033764 rhythmic process Effects 0.000 title claims abstract description 106
- 238000000034 method Methods 0.000 title claims abstract description 63
- 238000003058 natural language processing Methods 0.000 title claims abstract description 23
- 238000013528 artificial neural network Methods 0.000 claims abstract description 15
- 238000012549 training Methods 0.000 claims abstract description 11
- 230000011218 segmentation Effects 0.000 claims description 15
- 239000000284 extract Substances 0.000 claims description 13
- 238000002372 labelling Methods 0.000 claims description 12
- 230000000306 recurrent effect Effects 0.000 claims description 11
- 230000006835 compression Effects 0.000 claims description 10
- 238000007906 compression Methods 0.000 claims description 10
- 239000012634 fragment Substances 0.000 claims description 6
- 230000007774 longterm Effects 0.000 claims description 2
- 238000005516 engineering process Methods 0.000 abstract description 3
- 230000007246 mechanism Effects 0.000 abstract description 2
- 230000002457 bidirectional effect Effects 0.000 abstract 1
- 230000008569 process Effects 0.000 description 8
- 230000008901 benefit Effects 0.000 description 5
- 238000000605 extraction Methods 0.000 description 3
- 230000007935 neutral effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 239000003999 initiator Substances 0.000 description 1
- 230000007787 long-term memory Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 210000005036 nerve Anatomy 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000003908 quality control method Methods 0.000 description 1
- 230000003134 recirculating effect Effects 0.000 description 1
- 230000008439 repair process Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/04—Segmentation; Word boundary detection
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/14—Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
- G10L15/142—Hidden Markov Models [HMMs]
- G10L15/148—Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1807—Speech classification or search using natural language modelling using prosody or stress
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1815—Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/27—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
- G10L25/30—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Acoustics & Sound (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Probability & Statistics with Applications (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to an automatic rhythm extracting method, a system and an application thereof in natural language processing; the method includes steps of applying an automatic text-voice alignment technology to generate a large-scale rhythm data set, and applying a circular neural network to perform modeling on the rhythm of a sentence, adding a bidirectional expanding mechanism; applying the automatically structured text rhythm data to a natural language processing task based on the circular neural network. The method fully uses the isomorphism properties of common sequence data in the text rhythm sequence and the natural language processing task; through an alternative training method under the multi-task study, the natural language processing task is promoted without the assistance of artificially explicit marked semantic information. The practice of the method can overcome shortcomings of low efficiency, different standards, incapability of large-scale application of the artificial rhythm marking; meanwhile, the method can transfer semantics and pragmatics in massive voice data to the other tasks.
Description
Technical field
The present invention relates to a kind of phonetic-rhythm extracting method, more particularly to a kind of automatic rhythm extracting method, system
And its application in natural language processing task.
Background technology
The rhythm in voice can react the meaning of speaker by different terms in imparting sentence with different saliences
Scheme, therefore rhythm salience is considered as the Semantic and pragmatic that understand voice have indicative effect, the rhythm of voice
Mainly include the information such as liaison, sense-group pause, stressed, rising-falling tone.And in addition to voice, text is used as can express Semantic
With another kind of form of pragmatic, its prosody characteristics for including can be to be understood and learnt by different readers, i.e. text
The prosody characteristics of itself are contained, this characteristic can be learnt and prediction, while this rhythm for including can be for other certainly
So language processing tasks provide Semantic and the guidance on pragmatic, and then lift their performance.Implicit expression in text data
The rhythm cannot directly be observed and obtained, therefore, the rhythm of simultaneously its correspondence text of labelling can only be obtained from speech data, and then
Algorithm Learning could be allowed how the rhythm to be perceived and predicted from plain text, so as to for other natural language processing tasks provide except
There is the guidance beyond the syntactic information of supervision.
Most current natural language processing framework with word and its represents (term vector) as ultimate unit, and in voice
Prosody characteristics show as continuous characteristic sequence, and voice does not have obvious word segmentation point, adds based on speech recognition
The accurate word rhythm of technology extracts language material and the training that cannot obtain extensive high-quality, causes most of for voice at present
The extraction of prosody characteristics and the method for utilizing be required to the people of expertise manually split sound bite, alignment voice with
Text, mark word prosodic features etc. so that have the generation process efficiency of monitoring data low.
There is following pertinent literature in prior art:
1)Brenier,J.M.;Cer,D.M.;and Jurafsky,D.2005.The detection of emphatic
words using acoustic and lexical features.In INTERSPEECH,3297-3300.
2)Brenier,J.M.2008.The Automatic Prediction of Prosodic Prominence
from Text.ProQuest.
There is provided method and its corresponding evaluation index that the rhythm is predicted using plain text.Document has used ToBI tool sets
Artificial segmentation is carried out to voice and its correspondence text to mark with rhythm salience, according to the corresponding phonetic feature of various words,
Such as:Pronunciation duration (duration), intensity of phonation (intensity), the maximin of pronunciation base frequency
(fundamental frequency minima and maxima) etc. generates text judging whether it is highlighted
Rhythm data set.Document has used maximum entropy classifiers to learn and predict the rhythm of text simultaneously, is only using text
In the case of feature, grader can reach 79% or so predictablity rate.The rhythm number that document above will not generated
It is applied to aid in other natural language processing tasks according to collection.
An other pertinent literature:
3)Hovy,D.;Anumanchipalli,G.K.;Parlikar,A.;Vaughn,C.;Lammert,A.;Hovy,
E.;and Black,A.W.2013.Analysis and Modeling of“Focus”in Context.In
INTERSPEECH,402-406.
There is provided a kind of utilization plain text from the method for the context-prediction rhythm.Document makes on the basis of related work
The prediction of the text rhythm is facilitated with context, and has used the method for mass-rent (crowdsourcing) to carry out a set pattern
The artificial rhythm data set mark of mould.
In above-named three pertinent literatures, it is required to manually be labeled word rhythm attribute bar none,
Need the segmentation that carries out voice before mark simultaneously and its align with text, this is caused in efficiency to the generation of data set
Limit so that the method cannot at short notice obtain a large amount of labeled data, thus the method mentioned in the document of upper section lacks
Weary actual effect, it is impossible to apply in actual production.Meanwhile, the data set sample size that above method is produced is not enough to cover all rhythms
The problem space of rule prediction so that algorithm extensibility is not strong, causes using the not enough situation of upper performance.
Therefore, not being found in prior art can extract the side of the corresponding prosody characteristics of word from voice automatically
Method, its whole is and manually carry out manual extraction, while in existing pertinent literature, not being found any use voice correspondence
Text prosody characteristics assisting natural language processing tasks record or practical application, in this specific category, the present invention provide
First feasible method.
The content of the invention
It is contemplated that at least solving one of technical problem present in prior art.
For this purpose, it is an object of the present invention to proposing that a kind of efficient rhythm automatically is extracted and its appointed in natural language processing
The method applied in business, this method can overcome Traditional Man mark poorly efficient, standard differ, cannot large-scale application lack
Fall into, while the semanteme and pragmatic characteristic that can will be present in a large amount of speech datas are moved in other tasks, exist as one kind
Unsupervised data genaration mode on mark, the present invention can effectively utilize the rhythm model in voice, to other natural languages
The performance of speech process task is improved.
For achieving the above object, the invention provides a kind of automatic speech rhythm extract mask method, the method include as
Lower step:
Step 1, receives speech data to be marked, obtains the corresponding text of the speech data;
Step 2, time shafts are carried out using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text
On alignment, formed aligning texts;
The aligning texts are carried out sentence segmentation by step 3, so as to generate the sample in units of sentence;
Step 4, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to construct and obtains
The text rhythm data set of automatic marking, wherein, rhythm salience mark (or prosodic labeling of sentence) of the sentence is referred to
The corresponding sequence of values of sentence, the sequence reflects the rhythm that sentence different piece (or elementary cell) has by numerical values recited
Salience intensity.
More specifically, described in the step 2 alignment that speech data and its correspondence text is carried out on time shafts is concrete
Refer to:The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, it is described so as to obtain
The corresponding speech data fragment of each elementary cell in text, wherein, the elementary cell refers to the word or word of Chinese, English
Word.
More specifically, the step 4 also includes:If bright comprising multiple declaimers or multiple differences in primary voice data
The environment of reading, then need to do the pronunciation of different declaimers custom respectively standardization, and by the rhythm of the speech data
Rule feature carries out sliding-model control.
According to a further aspect in the invention, a kind of automatic rhythm extracting method is additionally provided in natural language processing task
Application, the method includes:
Using the rhythm of text data as a sequence labelling task, using shot and long term memory artificial neural network (LSTM)
For rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point prediction simultaneously
The rhythm salience mark of output current location elementary cell.
More specifically, the LSTM models extend to the circulation of two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time
Neutral net and its derived type and structure etc..
More specifically, the method also includes:
Text rhythm data set is used for based on the sentence compression duty of Recognition with Recurrent Neural Network (RNN):The text rhythm is dashed forward
Going out property is marked as nonproductive task, using sentence compression duty as main task, using the alternating training side under multi-task learning
Formula, each time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input
Another task, two tasks alternately, until the model convergence.
More specifically, the method also includes:
Text rhythm data set is used for into auxiliary based on Recognition with Recurrent Neural Network and its natural language of related expanding improved structure
Speech process task:Using text rhythm salience mark as nonproductive task, using sentence compression duty as main task, using many
Alternating training method under tasking learning, each time period is compressed to a part of text rhythm data of the mode input or sentence
Data, the next time period is input into another task, two tasks alternately, by optimizing the model parameter, until
The model convergence.
According to a further aspect in the invention, additionally provide a kind of automatic speech rhythm and extract labeling system, the system includes:
Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data;
Alignment module, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and its text
Alignment, formed aligning texts;
The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence;
Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, from
And the text rhythm data set of automatic marking is constructed and obtains, wherein, the rhythm salience mark of the sentence (or the rhythm of sentence
Rule mark) the corresponding sequence of values of sentence is referred to, the sequence reflects sentence different piece (or elementary cell) by numerical values recited
The rhythm salience intensity being had.
More specifically, described in the alignment module speech data and its correspondence text carries out the alignment tool on time shafts
Body is referred to:The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, so as to obtain
The corresponding speech data fragment of each elementary cell in text is stated, wherein, the elementary cell refers to the word or word of Chinese, English
The word of text.
More specifically, the segmentation module is additionally operable to:
If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to read aloud difference
Person's pronunciation custom does respectively standardization, and as needed prosodic features is carried out into sliding-model control.
The present invention has following Advantageous Effects:
1) autotext-voice alignment techniques have been used to carry out the generation of extensive rhythm data set, after alignment
Sound bite can construct tool as rhythm index by the mark quality control of rhythm salience on the basis of some strength
There is the text rhythm data set of Weakly supervised characteristic, compared to traditional artificial mark means, in addition to advantage in hgher efficiency,
Traditional approach is also significantly better than in autgmentability, can at any time add priori with adjust data set actual annotation results and
Performance, processing speed is fast, low cost, and the data (same time of flood tide is constructed in the case of saving a large amount of human resourcess
Interior generation data volume is more than traditional method more than two orders of magnitude).
2) present invention uses Recognition with Recurrent Neural Network is modeled to the rhythm of sentence, add two-way extension mechanism it
Afterwards, Recognition with Recurrent Neural Network can effectively consider the context state of word, accurate for the prediction of word rhythm salience mark
Rate can reach more than 90%, be significantly better than traditional maximum entropy method, while carrying out feature extraction without the need for expertise, reduce special
While levying engineering, flow process more conforms to the process of human cognitive.
3) present invention is used for the text rhythm data set of automatic construction based on the natural language processing of Recognition with Recurrent Neural Network
In task.
The method takes full advantage of the isomorphism of the most preferred sequence data in text rhythm sequence and natural language processing task
Characteristic, by the alternating training method under multi-task learning so that natural language processing task need not marked explicitly
Get a promotion under the auxiliary of semantic information.In the example of sentence compression duty, the method for the present invention has relative to prior art
Significant performance boost (more than 10% performance boost).
The additional aspect and advantage of the present invention will be given in following description section, partly will be become from the following description
Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become from the description with reference to accompanying drawings below to embodiment
It is substantially and easy to understand, wherein:
Fig. 1 shows the flow chart that mask method is extracted according to a kind of automatic speech rhythm of the invention;
Multitask LSTM models treated mode figure of the invention is shown in Fig. 2;
The two-way LSTM models treated mode figure of multitask of the invention is shown in Fig. 3;
Fig. 4 shows that a kind of automatic speech rhythm of the invention extracts the system block diagram of labeling system.
Specific embodiment
It is below in conjunction with the accompanying drawings and concrete real in order to be more clearly understood that the above objects, features and advantages of the present invention
Apply mode to be further described in detail the present invention.It should be noted that in the case where not conflicting, the enforcement of the application
Feature in example and embodiment can be mutually combined.
Many details are elaborated in the following description in order to fully understand the present invention, but, the present invention may be used also
Implemented with being different from mode described here using other, therefore, protection scope of the present invention does not receive following public tool
The restriction of body embodiment.
Fig. 1 shows the flow chart that mask method is extracted according to a kind of automatic speech rhythm of the invention.
As shown in figure 1, a kind of automatic speech rhythm of the invention extracts mask method, the method includes following step
Suddenly:
Step 1, receives speech data to be marked, obtains the corresponding text of the speech data.
Step 2, time shafts are carried out using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text
On alignment, formed aligning texts;
Specifically, a period of time axle on the elementary cell correspondence speech data in each text can be passed through, from
And obtain the corresponding speech data fragment of each elementary cell in the text.Wherein, elementary cell, refer to word in Chinese or
Word, a word in English.
Additionally, Text-To-Speech alignment techniques are including but not limited to by each elementary cell in the acquisition speech data
The initiator corresponding time is played to the pronunciation corresponding time is terminated, so as to obtain each elementary cell institute in the speech data
Time period between a period of time axle and elementary cell.
The aligning texts are carried out sentence segmentation by step 3, generate the sample in units of sentence.
For example, the punctuation mark characteristic according to sentence is can be, but not limited to, sentence segmentation is carried out to aligning texts, made
Obtain each sentence to be made up of the elementary cell for being accompanied with correspondence speech data fragment.
Step 4, the automatic rhythm salience dimensioning algorithm of each sentence application after splitting to the sentence in text, so as to
Construct and obtain the text rhythm data set of automatic marking.
Specifically, also include in this step:If bright comprising multiple declaimers or multiple differences in primary voice data
The environment of reading, then need to do standardization respectively to different declaimers pronunciation custom, to eliminate impact therein, and according to need
Sliding-model control is carried out to the prosodic features of speech data.Wherein, prosodic features refer to the elementary cell UL,
The maxima and minima of intensity of phonation, pronunciation base frequency.
To the automatic rhythm salience dimensioning algorithm of each sentence application in text after sentence segmentation, can select
The some or all of feature in three prosodic features is stated as the input of automatic rhythm salience dimensioning algorithm, wherein, it is described
Rhythm salience mark (or prosodic labeling of sentence) of sentence refers to the corresponding sequence of values of sentence, and the sequence is big by numerical value
The rhythm salience intensity that little reflection sentence different piece (or elementary cell) has.
According to the second aspect of the invention, additionally provide a kind of automatic rhythm and be extracted in answering in natural language processing task
With method, the application process includes:
Using for the rhythm of text data is used as a sequence labelling task, artificial neural network is remembered using shot and long term
(LSTM) for rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point
Predict and export the rhythm salience mark of current location elementary cell.
More specifically, the LSTM models extend to the circulation of two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time
Neutral net and its derived type and structure, such as gate duration recirculating network (Gated Recurrent Network, GRN) etc..
More specifically, the application process also includes:
Text rhythm data set is used for based on the sentence compression duty of Recognition with Recurrent Neural Network (RNN):The text rhythm is dashed forward
Going out property is marked as nonproductive task, using sentence compression duty as main task, using the alternating training side under multi-task learning
Formula, each time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input
Another task, two tasks alternately, until the model convergence.Multitask of the invention is shown in Fig. 2
LSTM models treated modes, text rhythm salience mark as nonproductive task, the output of corresponding A number of Node, and sentence pressure
Contracting task is used as main task, the output of correspondence Y-series node.By the way of alternately training, each time period is defeated to model
Enter a part of text rhythm salience mark task data or sentence compressed data, the next time period is input into another
Business, two tasks alternately, until model convergence.Show in Fig. 3 at the two-way LSTM models of multitask of the invention
Reason mode.
More specifically, the application process also includes:
Text rhythm data set is used for based on the natural language processing task of Recognition with Recurrent Neural Network:The text rhythm is projected
Property mark as nonproductive task, using sentence compression duty as main task, using the alternating training method under multi-task learning,
Each time period is input in addition to a part of text rhythm data of the mode input or sentence compressed data, next time period
One task, two tasks alternately, by optimizing the model parameter, until model convergence.Wherein, nerve is circulated
Network includes but is not limited to LSTM, GRU and its extension in depth.
For aforesaid way can use Formal Language Description, if X is the text sequence of input, A is text sequence correspondence
Rhythm salience sequence, Y is the corresponding compact token of text, and three sequences correspond to following form:
X=(x1..., xN),
A=(a1..., aN)
Y=(y1..., yN)
Above-mentioned task actually optimizes following problem:
For LSTM models (on), p can be expressed as:
For two-way LSTM models (under), p can be expressed as:
Wherein,
Using the parameter θ * after optimization, the rhythm salience A prediction outputs of model are expressed as:
In the same manner for major prognostic task Y of model, the expression formula of isomorphism can be obtained, be repeated no more here.
Fig. 4 shows that a kind of automatic speech rhythm of the invention extracts the system block diagram of labeling system.
As shown in figure 4, the system includes:
Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data;
Alignment module, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and its text
Alignment, formed aligning texts;
The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence;
Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, from
And the text rhythm data set of automatic marking is constructed and obtains, wherein, the rhythm salience mark of the sentence (or the rhythm of sentence
Rule mark) the corresponding sequence of values of sentence is referred to, the sequence reflects sentence different piece (or elementary cell) by numerical values recited
The rhythm salience intensity being had.
More specifically, described in the alignment module speech data and its correspondence text carries out the alignment tool on time shafts
Body is referred to:The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, so as to obtain
The corresponding speech data fragment of each elementary cell in text is stated, wherein, the elementary cell refers to the word or word of Chinese, English
The word of text.
More specifically, the segmentation module is additionally operable to:
If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to read aloud difference
Person's pronunciation custom does respectively standardization, and as needed the prosodic features of the speech data is carried out at discretization
Reason.
The present invention by autotext-voice alignment techniques, by sound bite with to answering word to align in text, profit
With sound bite as word rhythm salience index, so as to obtain a large amount of text rhythm numbers with mark for automatically generating
According to structure text rhythm data set.
Meanwhile, the present invention utilizes Weakly supervised characteristic, by text rhythm data set using the mode of multi-task learning, is circulating
Under the model structure of neutral net, carry out replacing training with other natural language processing tasks, so as to reach other tasks are improved
The purpose of performance.
In the description of this specification, the description of term " one embodiment ", " specific embodiment " etc. means to combine the reality
Specific features, structure or the feature for applying example or example description is contained at least one embodiment of the present invention or example.At this
In description, identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.And, description it is concrete
Feature, structure or feature can in an appropriate manner be combined in any one or more embodiments or example.
The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area
For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair
Change, equivalent, improvement etc., should be included within the scope of the present invention.
Claims (10)
1. a kind of automatic speech rhythm extracts mask method, it is characterised in that the method comprises the steps:
Step 1, receives speech data to be marked, obtains the corresponding text of the speech data;
Step 2, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text
Alignment, forms aligning texts;
The aligning texts are carried out sentence segmentation by step 3, so as to generate the sample in units of sentence;
Step 4, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to construct and obtains automatically
The text rhythm data set of mark.
2. a kind of automatic speech rhythm according to claim 1 extracts mask method, it is characterised in that in the step 2
The alignment that described speech data and its correspondence text is carried out on time shafts is specifically referred to:So that the elementary cell in each text
A period of time axle on the speech data can be corresponded to, so as to obtain the text in the corresponding voice number of each elementary cell
According to fragment, wherein, the elementary cell refers to the word or word of Chinese, the word of English.
3. a kind of automatic speech rhythm according to claim 1 extracts mask method, it is characterised in that the step 4 is also
Including:If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to different declaimers
Pronunciation custom do standardization respectively, and the prosodic features of the speech data are carried out at discretization as needed
Reason.
4. a kind of a kind of automatic rhythm extracting method as described in any one of claim 1-3 is in natural language processing task
Using, it is characterised in that the method includes:
Using the rhythm of text data as a sequence labelling task, artificial neural network is remembered using shot and long term(LSTM)For
Rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point prediction and is exported
The rhythm salience mark of current location elementary cell.
5. application of a kind of automatic rhythm extracting method according to claim 4 in natural language processing task, it is special
Levy and be, the LSTM models extend to two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Recognition with Recurrent Neural Network and
Its derived type and structure etc..
6. application of a kind of automatic rhythm extracting method according to claim 5 in natural language processing task, it is special
Levy and be, the method also includes:
Text rhythm data set is used to be based on Recognition with Recurrent Neural Network(RNN)Sentence compression duty:By text rhythm salience
Mark as nonproductive task, using sentence compression duty as main task, using the alternating training method under multi-task learning, often
The individual time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input other
Individual task, two tasks alternately, until the model convergence.
7. application of a kind of automatic rhythm extracting method according to claim 5 in natural language processing task, it is special
Levy and be, the method also includes:
The natural language processing that text rhythm data set is used for based on Recognition with Recurrent Neural Network and its related expanding improved structure is appointed
Business:The salience of the text rhythm is marked as nonproductive task, using sentence compression duty as main task, using multi-task learning
Under alternating training method, each time period to a part of text rhythm data of the mode input or sentence compressed data, under
One time period is input into another task, two tasks alternately, by optimizing the model parameter, until the model
Convergence.
8. a kind of automatic speech rhythm extracts labeling system, it is characterised in that the system includes:
Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data;
Alignment module, the speech data that collects and its text are carried out using Text-To-Speech alignment techniques right on time shafts
Together, aligning texts are formed;
The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence;
Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to structure
Make and obtain the text rhythm data set of automatic marking.
9. a kind of automatic speech rhythm according to claim 8 extracts labeling system, it is characterised in that the alignment module
Described in speech data and its correspondence alignment that carries out on time shafts of text specifically refer to:So that substantially single in each text
Unit can correspond to a period of time axle on the speech data, so as to obtain the text in each corresponding voice of elementary cell
Data slot, wherein, the elementary cell refers to the word or word of Chinese, the word of English.
10. a kind of automatic speech rhythm according to claim 8 extracts labeling system, it is characterised in that the segmentation mould
Block is additionally operable to:
If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, need to send out different declaimers
Sound custom does respectively standardization, and as needed the prosodic features of the speech data is carried out into sliding-model control.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710023633.8A CN106683667A (en) | 2017-01-13 | 2017-01-13 | Automatic rhythm extracting method, system and application thereof in natural language processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710023633.8A CN106683667A (en) | 2017-01-13 | 2017-01-13 | Automatic rhythm extracting method, system and application thereof in natural language processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106683667A true CN106683667A (en) | 2017-05-17 |
Family
ID=58858838
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710023633.8A Pending CN106683667A (en) | 2017-01-13 | 2017-01-13 | Automatic rhythm extracting method, system and application thereof in natural language processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106683667A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
WO2020024582A1 (en) * | 2018-07-28 | 2020-02-06 | 华为技术有限公司 | Speech synthesis method and related device |
CN111105785A (en) * | 2019-12-17 | 2020-05-05 | 广州多益网络股份有限公司 | Text prosodic boundary identification method and device |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
CN111989696A (en) * | 2018-04-18 | 2020-11-24 | 渊慧科技有限公司 | Neural network for scalable continuous learning in domains with sequential learning tasks |
CN112136141A (en) * | 2018-03-23 | 2020-12-25 | 谷歌有限责任公司 | Robot based on free form natural language input control |
CN112183086A (en) * | 2020-09-23 | 2021-01-05 | 北京先声智能科技有限公司 | English pronunciation continuous reading mark model based on sense group labeling |
CN112307236A (en) * | 2019-07-24 | 2021-02-02 | 阿里巴巴集团控股有限公司 | Data labeling method and device |
CN117012178A (en) * | 2023-07-31 | 2023-11-07 | 支付宝(杭州)信息技术有限公司 | Prosody annotation data generation method and device |
-
2017
- 2017-01-13 CN CN201710023633.8A patent/CN106683667A/en active Pending
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112136141A (en) * | 2018-03-23 | 2020-12-25 | 谷歌有限责任公司 | Robot based on free form natural language input control |
US11972339B2 (en) | 2018-03-23 | 2024-04-30 | Google Llc | Controlling a robot based on free-form natural language input |
CN111989696A (en) * | 2018-04-18 | 2020-11-24 | 渊慧科技有限公司 | Neural network for scalable continuous learning in domains with sequential learning tasks |
US12020164B2 (en) | 2018-04-18 | 2024-06-25 | Deepmind Technologies Limited | Neural networks for scalable continual learning in domains with sequentially learned tasks |
CN108986798A (en) * | 2018-06-27 | 2018-12-11 | 百度在线网络技术(北京)有限公司 | Processing method, device and the equipment of voice data |
WO2020024582A1 (en) * | 2018-07-28 | 2020-02-06 | 华为技术有限公司 | Speech synthesis method and related device |
CN112307236A (en) * | 2019-07-24 | 2021-02-02 | 阿里巴巴集团控股有限公司 | Data labeling method and device |
CN111105785A (en) * | 2019-12-17 | 2020-05-05 | 广州多益网络股份有限公司 | Text prosodic boundary identification method and device |
CN111507104A (en) * | 2020-03-19 | 2020-08-07 | 北京百度网讯科技有限公司 | Method and device for establishing label labeling model, electronic equipment and readable storage medium |
US11531813B2 (en) | 2020-03-19 | 2022-12-20 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method, electronic device and readable storage medium for creating a label marking model |
CN112183086A (en) * | 2020-09-23 | 2021-01-05 | 北京先声智能科技有限公司 | English pronunciation continuous reading mark model based on sense group labeling |
CN117012178A (en) * | 2023-07-31 | 2023-11-07 | 支付宝(杭州)信息技术有限公司 | Prosody annotation data generation method and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106683667A (en) | Automatic rhythm extracting method, system and application thereof in natural language processing | |
Liu et al. | Learning natural language inference using bidirectional LSTM model and inner-attention | |
CN108597541B (en) | Speech emotion recognition method and system for enhancing anger and happiness recognition | |
WO2023024412A1 (en) | Visual question answering method and apparatus based on deep learning model, and medium and device | |
CN108255805A (en) | The analysis of public opinion method and device, storage medium, electronic equipment | |
CN103761975B (en) | Method and device for oral evaluation | |
CN110647612A (en) | Visual conversation generation method based on double-visual attention network | |
CN109979429A (en) | A kind of method and system of TTS | |
CN107220235A (en) | Speech recognition error correction method, device and storage medium based on artificial intelligence | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN107731228A (en) | The text conversion method and device of English voice messaging | |
CN108549658A (en) | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree | |
CN107679225A (en) | A kind of reply generation method based on keyword | |
CN112579794B (en) | Method and system for predicting semantic tree for Chinese and English word pairs | |
CN104538025A (en) | Method and device for converting gestures to Chinese and Tibetan bilingual voices | |
CN110717341A (en) | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot | |
CN110852089A (en) | Operation and maintenance project management method based on intelligent word segmentation and deep learning | |
Kasai et al. | End-to-end graph-based TAG parsing with neural networks | |
CN114360584A (en) | Phoneme-level-based speech emotion layered recognition method and system | |
CN116029303A (en) | Language expression mode identification method, device, electronic equipment and storage medium | |
CN116483314A (en) | Automatic intelligent activity diagram generation method | |
CN115221284A (en) | Text similarity calculation method and device, electronic equipment and storage medium | |
CN109002540A (en) | It is a kind of Chinese notice document problem answers to automatic generation method | |
CN114943235A (en) | Named entity recognition method based on multi-class language model | |
Sirirattanajakarin et al. | BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170517 |
|
WD01 | Invention patent application deemed withdrawn after publication |