CN106683667A - Automatic rhythm extracting method, system and application thereof in natural language processing - Google Patents

Automatic rhythm extracting method, system and application thereof in natural language processing Download PDF

Info

Publication number
CN106683667A
CN106683667A CN201710023633.8A CN201710023633A CN106683667A CN 106683667 A CN106683667 A CN 106683667A CN 201710023633 A CN201710023633 A CN 201710023633A CN 106683667 A CN106683667 A CN 106683667A
Authority
CN
China
Prior art keywords
rhythm
text
data
sentence
automatic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710023633.8A
Other languages
Chinese (zh)
Inventor
陈彦局
潘嵘
李双印
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ipin Information Technology Co Ltd
Original Assignee
Shenzhen Ipin Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ipin Information Technology Co Ltd filed Critical Shenzhen Ipin Information Technology Co Ltd
Priority to CN201710023633.8A priority Critical patent/CN106683667A/en
Publication of CN106683667A publication Critical patent/CN106683667A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/14Speech classification or search using statistical models, e.g. Hidden Markov Models [HMMs]
    • G10L15/142Hidden Markov Models [HMMs]
    • G10L15/148Duration modelling in HMMs, e.g. semi HMM, segmental models or transition probabilities
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1807Speech classification or search using natural language modelling using prosody or stress
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1815Semantic context, e.g. disambiguation of the recognition hypotheses based on word meaning
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Probability & Statistics with Applications (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to an automatic rhythm extracting method, a system and an application thereof in natural language processing; the method includes steps of applying an automatic text-voice alignment technology to generate a large-scale rhythm data set, and applying a circular neural network to perform modeling on the rhythm of a sentence, adding a bidirectional expanding mechanism; applying the automatically structured text rhythm data to a natural language processing task based on the circular neural network. The method fully uses the isomorphism properties of common sequence data in the text rhythm sequence and the natural language processing task; through an alternative training method under the multi-task study, the natural language processing task is promoted without the assistance of artificially explicit marked semantic information. The practice of the method can overcome shortcomings of low efficiency, different standards, incapability of large-scale application of the artificial rhythm marking; meanwhile, the method can transfer semantics and pragmatics in massive voice data to the other tasks.

Description

A kind of automatic rhythm extracting method, system and its in natural language processing task Using
Technical field
The present invention relates to a kind of phonetic-rhythm extracting method, more particularly to a kind of automatic rhythm extracting method, system And its application in natural language processing task.
Background technology
The rhythm in voice can react the meaning of speaker by different terms in imparting sentence with different saliences Scheme, therefore rhythm salience is considered as the Semantic and pragmatic that understand voice have indicative effect, the rhythm of voice Mainly include the information such as liaison, sense-group pause, stressed, rising-falling tone.And in addition to voice, text is used as can express Semantic With another kind of form of pragmatic, its prosody characteristics for including can be to be understood and learnt by different readers, i.e. text The prosody characteristics of itself are contained, this characteristic can be learnt and prediction, while this rhythm for including can be for other certainly So language processing tasks provide Semantic and the guidance on pragmatic, and then lift their performance.Implicit expression in text data The rhythm cannot directly be observed and obtained, therefore, the rhythm of simultaneously its correspondence text of labelling can only be obtained from speech data, and then Algorithm Learning could be allowed how the rhythm to be perceived and predicted from plain text, so as to for other natural language processing tasks provide except There is the guidance beyond the syntactic information of supervision.
Most current natural language processing framework with word and its represents (term vector) as ultimate unit, and in voice Prosody characteristics show as continuous characteristic sequence, and voice does not have obvious word segmentation point, adds based on speech recognition The accurate word rhythm of technology extracts language material and the training that cannot obtain extensive high-quality, causes most of for voice at present The extraction of prosody characteristics and the method for utilizing be required to the people of expertise manually split sound bite, alignment voice with Text, mark word prosodic features etc. so that have the generation process efficiency of monitoring data low.
There is following pertinent literature in prior art:
1)Brenier,J.M.;Cer,D.M.;and Jurafsky,D.2005.The detection of emphatic words using acoustic and lexical features.In INTERSPEECH,3297-3300.
2)Brenier,J.M.2008.The Automatic Prediction of Prosodic Prominence from Text.ProQuest.
There is provided method and its corresponding evaluation index that the rhythm is predicted using plain text.Document has used ToBI tool sets Artificial segmentation is carried out to voice and its correspondence text to mark with rhythm salience, according to the corresponding phonetic feature of various words, Such as:Pronunciation duration (duration), intensity of phonation (intensity), the maximin of pronunciation base frequency (fundamental frequency minima and maxima) etc. generates text judging whether it is highlighted Rhythm data set.Document has used maximum entropy classifiers to learn and predict the rhythm of text simultaneously, is only using text In the case of feature, grader can reach 79% or so predictablity rate.The rhythm number that document above will not generated It is applied to aid in other natural language processing tasks according to collection.
An other pertinent literature:
3)Hovy,D.;Anumanchipalli,G.K.;Parlikar,A.;Vaughn,C.;Lammert,A.;Hovy, E.;and Black,A.W.2013.Analysis and Modeling of“Focus”in Context.In INTERSPEECH,402-406.
There is provided a kind of utilization plain text from the method for the context-prediction rhythm.Document makes on the basis of related work The prediction of the text rhythm is facilitated with context, and has used the method for mass-rent (crowdsourcing) to carry out a set pattern The artificial rhythm data set mark of mould.
In above-named three pertinent literatures, it is required to manually be labeled word rhythm attribute bar none, Need the segmentation that carries out voice before mark simultaneously and its align with text, this is caused in efficiency to the generation of data set Limit so that the method cannot at short notice obtain a large amount of labeled data, thus the method mentioned in the document of upper section lacks Weary actual effect, it is impossible to apply in actual production.Meanwhile, the data set sample size that above method is produced is not enough to cover all rhythms The problem space of rule prediction so that algorithm extensibility is not strong, causes using the not enough situation of upper performance.
Therefore, not being found in prior art can extract the side of the corresponding prosody characteristics of word from voice automatically Method, its whole is and manually carry out manual extraction, while in existing pertinent literature, not being found any use voice correspondence Text prosody characteristics assisting natural language processing tasks record or practical application, in this specific category, the present invention provide First feasible method.
The content of the invention
It is contemplated that at least solving one of technical problem present in prior art.
For this purpose, it is an object of the present invention to proposing that a kind of efficient rhythm automatically is extracted and its appointed in natural language processing The method applied in business, this method can overcome Traditional Man mark poorly efficient, standard differ, cannot large-scale application lack Fall into, while the semanteme and pragmatic characteristic that can will be present in a large amount of speech datas are moved in other tasks, exist as one kind Unsupervised data genaration mode on mark, the present invention can effectively utilize the rhythm model in voice, to other natural languages The performance of speech process task is improved.
For achieving the above object, the invention provides a kind of automatic speech rhythm extract mask method, the method include as Lower step:
Step 1, receives speech data to be marked, obtains the corresponding text of the speech data;
Step 2, time shafts are carried out using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text On alignment, formed aligning texts;
The aligning texts are carried out sentence segmentation by step 3, so as to generate the sample in units of sentence;
Step 4, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to construct and obtains The text rhythm data set of automatic marking, wherein, rhythm salience mark (or prosodic labeling of sentence) of the sentence is referred to The corresponding sequence of values of sentence, the sequence reflects the rhythm that sentence different piece (or elementary cell) has by numerical values recited Salience intensity.
More specifically, described in the step 2 alignment that speech data and its correspondence text is carried out on time shafts is concrete Refer to:The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, it is described so as to obtain The corresponding speech data fragment of each elementary cell in text, wherein, the elementary cell refers to the word or word of Chinese, English Word.
More specifically, the step 4 also includes:If bright comprising multiple declaimers or multiple differences in primary voice data The environment of reading, then need to do the pronunciation of different declaimers custom respectively standardization, and by the rhythm of the speech data Rule feature carries out sliding-model control.
According to a further aspect in the invention, a kind of automatic rhythm extracting method is additionally provided in natural language processing task Application, the method includes:
Using the rhythm of text data as a sequence labelling task, using shot and long term memory artificial neural network (LSTM) For rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point prediction simultaneously The rhythm salience mark of output current location elementary cell.
More specifically, the LSTM models extend to the circulation of two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Neutral net and its derived type and structure etc..
More specifically, the method also includes:
Text rhythm data set is used for based on the sentence compression duty of Recognition with Recurrent Neural Network (RNN):The text rhythm is dashed forward Going out property is marked as nonproductive task, using sentence compression duty as main task, using the alternating training side under multi-task learning Formula, each time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input Another task, two tasks alternately, until the model convergence.
More specifically, the method also includes:
Text rhythm data set is used for into auxiliary based on Recognition with Recurrent Neural Network and its natural language of related expanding improved structure Speech process task:Using text rhythm salience mark as nonproductive task, using sentence compression duty as main task, using many Alternating training method under tasking learning, each time period is compressed to a part of text rhythm data of the mode input or sentence Data, the next time period is input into another task, two tasks alternately, by optimizing the model parameter, until The model convergence.
According to a further aspect in the invention, additionally provide a kind of automatic speech rhythm and extract labeling system, the system includes:
Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data;
Alignment module, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and its text Alignment, formed aligning texts;
The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence;
Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, from And the text rhythm data set of automatic marking is constructed and obtains, wherein, the rhythm salience mark of the sentence (or the rhythm of sentence Rule mark) the corresponding sequence of values of sentence is referred to, the sequence reflects sentence different piece (or elementary cell) by numerical values recited The rhythm salience intensity being had.
More specifically, described in the alignment module speech data and its correspondence text carries out the alignment tool on time shafts Body is referred to:The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, so as to obtain The corresponding speech data fragment of each elementary cell in text is stated, wherein, the elementary cell refers to the word or word of Chinese, English The word of text.
More specifically, the segmentation module is additionally operable to:
If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to read aloud difference Person's pronunciation custom does respectively standardization, and as needed prosodic features is carried out into sliding-model control.
The present invention has following Advantageous Effects:
1) autotext-voice alignment techniques have been used to carry out the generation of extensive rhythm data set, after alignment Sound bite can construct tool as rhythm index by the mark quality control of rhythm salience on the basis of some strength There is the text rhythm data set of Weakly supervised characteristic, compared to traditional artificial mark means, in addition to advantage in hgher efficiency, Traditional approach is also significantly better than in autgmentability, can at any time add priori with adjust data set actual annotation results and Performance, processing speed is fast, low cost, and the data (same time of flood tide is constructed in the case of saving a large amount of human resourcess Interior generation data volume is more than traditional method more than two orders of magnitude).
2) present invention uses Recognition with Recurrent Neural Network is modeled to the rhythm of sentence, add two-way extension mechanism it Afterwards, Recognition with Recurrent Neural Network can effectively consider the context state of word, accurate for the prediction of word rhythm salience mark Rate can reach more than 90%, be significantly better than traditional maximum entropy method, while carrying out feature extraction without the need for expertise, reduce special While levying engineering, flow process more conforms to the process of human cognitive.
3) present invention is used for the text rhythm data set of automatic construction based on the natural language processing of Recognition with Recurrent Neural Network In task.
The method takes full advantage of the isomorphism of the most preferred sequence data in text rhythm sequence and natural language processing task Characteristic, by the alternating training method under multi-task learning so that natural language processing task need not marked explicitly Get a promotion under the auxiliary of semantic information.In the example of sentence compression duty, the method for the present invention has relative to prior art Significant performance boost (more than 10% performance boost).
The additional aspect and advantage of the present invention will be given in following description section, partly will be become from the following description Obtain substantially, or recognized by the practice of the present invention.
Description of the drawings
The above-mentioned and/or additional aspect and advantage of the present invention will become from the description with reference to accompanying drawings below to embodiment It is substantially and easy to understand, wherein:
Fig. 1 shows the flow chart that mask method is extracted according to a kind of automatic speech rhythm of the invention;
Multitask LSTM models treated mode figure of the invention is shown in Fig. 2;
The two-way LSTM models treated mode figure of multitask of the invention is shown in Fig. 3;
Fig. 4 shows that a kind of automatic speech rhythm of the invention extracts the system block diagram of labeling system.
Specific embodiment
It is below in conjunction with the accompanying drawings and concrete real in order to be more clearly understood that the above objects, features and advantages of the present invention Apply mode to be further described in detail the present invention.It should be noted that in the case where not conflicting, the enforcement of the application Feature in example and embodiment can be mutually combined.
Many details are elaborated in the following description in order to fully understand the present invention, but, the present invention may be used also Implemented with being different from mode described here using other, therefore, protection scope of the present invention does not receive following public tool The restriction of body embodiment.
Fig. 1 shows the flow chart that mask method is extracted according to a kind of automatic speech rhythm of the invention.
As shown in figure 1, a kind of automatic speech rhythm of the invention extracts mask method, the method includes following step Suddenly:
Step 1, receives speech data to be marked, obtains the corresponding text of the speech data.
Step 2, time shafts are carried out using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text On alignment, formed aligning texts;
Specifically, a period of time axle on the elementary cell correspondence speech data in each text can be passed through, from And obtain the corresponding speech data fragment of each elementary cell in the text.Wherein, elementary cell, refer to word in Chinese or Word, a word in English.
Additionally, Text-To-Speech alignment techniques are including but not limited to by each elementary cell in the acquisition speech data The initiator corresponding time is played to the pronunciation corresponding time is terminated, so as to obtain each elementary cell institute in the speech data Time period between a period of time axle and elementary cell.
The aligning texts are carried out sentence segmentation by step 3, generate the sample in units of sentence.
For example, the punctuation mark characteristic according to sentence is can be, but not limited to, sentence segmentation is carried out to aligning texts, made Obtain each sentence to be made up of the elementary cell for being accompanied with correspondence speech data fragment.
Step 4, the automatic rhythm salience dimensioning algorithm of each sentence application after splitting to the sentence in text, so as to Construct and obtain the text rhythm data set of automatic marking.
Specifically, also include in this step:If bright comprising multiple declaimers or multiple differences in primary voice data The environment of reading, then need to do standardization respectively to different declaimers pronunciation custom, to eliminate impact therein, and according to need Sliding-model control is carried out to the prosodic features of speech data.Wherein, prosodic features refer to the elementary cell UL, The maxima and minima of intensity of phonation, pronunciation base frequency.
To the automatic rhythm salience dimensioning algorithm of each sentence application in text after sentence segmentation, can select The some or all of feature in three prosodic features is stated as the input of automatic rhythm salience dimensioning algorithm, wherein, it is described Rhythm salience mark (or prosodic labeling of sentence) of sentence refers to the corresponding sequence of values of sentence, and the sequence is big by numerical value The rhythm salience intensity that little reflection sentence different piece (or elementary cell) has.
According to the second aspect of the invention, additionally provide a kind of automatic rhythm and be extracted in answering in natural language processing task With method, the application process includes:
Using for the rhythm of text data is used as a sequence labelling task, artificial neural network is remembered using shot and long term (LSTM) for rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point Predict and export the rhythm salience mark of current location elementary cell.
More specifically, the LSTM models extend to the circulation of two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Neutral net and its derived type and structure, such as gate duration recirculating network (Gated Recurrent Network, GRN) etc..
More specifically, the application process also includes:
Text rhythm data set is used for based on the sentence compression duty of Recognition with Recurrent Neural Network (RNN):The text rhythm is dashed forward Going out property is marked as nonproductive task, using sentence compression duty as main task, using the alternating training side under multi-task learning Formula, each time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input Another task, two tasks alternately, until the model convergence.Multitask of the invention is shown in Fig. 2 LSTM models treated modes, text rhythm salience mark as nonproductive task, the output of corresponding A number of Node, and sentence pressure Contracting task is used as main task, the output of correspondence Y-series node.By the way of alternately training, each time period is defeated to model Enter a part of text rhythm salience mark task data or sentence compressed data, the next time period is input into another Business, two tasks alternately, until model convergence.Show in Fig. 3 at the two-way LSTM models of multitask of the invention Reason mode.
More specifically, the application process also includes:
Text rhythm data set is used for based on the natural language processing task of Recognition with Recurrent Neural Network:The text rhythm is projected Property mark as nonproductive task, using sentence compression duty as main task, using the alternating training method under multi-task learning, Each time period is input in addition to a part of text rhythm data of the mode input or sentence compressed data, next time period One task, two tasks alternately, by optimizing the model parameter, until model convergence.Wherein, nerve is circulated Network includes but is not limited to LSTM, GRU and its extension in depth.
For aforesaid way can use Formal Language Description, if X is the text sequence of input, A is text sequence correspondence Rhythm salience sequence, Y is the corresponding compact token of text, and three sequences correspond to following form:
X=(x1..., xN),
A=(a1..., aN)
Y=(y1..., yN)
Above-mentioned task actually optimizes following problem:
For LSTM models (on), p can be expressed as:
For two-way LSTM models (under), p can be expressed as:
Wherein,
Using the parameter θ * after optimization, the rhythm salience A prediction outputs of model are expressed as:
In the same manner for major prognostic task Y of model, the expression formula of isomorphism can be obtained, be repeated no more here.
Fig. 4 shows that a kind of automatic speech rhythm of the invention extracts the system block diagram of labeling system.
As shown in figure 4, the system includes:
Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data;
Alignment module, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and its text Alignment, formed aligning texts;
The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence;
Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, from And the text rhythm data set of automatic marking is constructed and obtains, wherein, the rhythm salience mark of the sentence (or the rhythm of sentence Rule mark) the corresponding sequence of values of sentence is referred to, the sequence reflects sentence different piece (or elementary cell) by numerical values recited The rhythm salience intensity being had.
More specifically, described in the alignment module speech data and its correspondence text carries out the alignment tool on time shafts Body is referred to:The elementary cell in each text is enabled to correspond to a period of time axle on the speech data, so as to obtain The corresponding speech data fragment of each elementary cell in text is stated, wherein, the elementary cell refers to the word or word of Chinese, English The word of text.
More specifically, the segmentation module is additionally operable to:
If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to read aloud difference Person's pronunciation custom does respectively standardization, and as needed the prosodic features of the speech data is carried out at discretization Reason.
The present invention by autotext-voice alignment techniques, by sound bite with to answering word to align in text, profit With sound bite as word rhythm salience index, so as to obtain a large amount of text rhythm numbers with mark for automatically generating According to structure text rhythm data set.
Meanwhile, the present invention utilizes Weakly supervised characteristic, by text rhythm data set using the mode of multi-task learning, is circulating Under the model structure of neutral net, carry out replacing training with other natural language processing tasks, so as to reach other tasks are improved The purpose of performance.
In the description of this specification, the description of term " one embodiment ", " specific embodiment " etc. means to combine the reality Specific features, structure or the feature for applying example or example description is contained at least one embodiment of the present invention or example.At this In description, identical embodiment or example are not necessarily referring to the schematic representation of above-mentioned term.And, description it is concrete Feature, structure or feature can in an appropriate manner be combined in any one or more embodiments or example.
The preferred embodiments of the present invention are the foregoing is only, the present invention is not limited to, for the skill of this area For art personnel, the present invention can have various modifications and variations.It is all within the spirit and principles in the present invention, made any repair Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (10)

1. a kind of automatic speech rhythm extracts mask method, it is characterised in that the method comprises the steps:
Step 1, receives speech data to be marked, obtains the corresponding text of the speech data;
Step 2, is carried out on time shafts using Text-To-Speech alignment techniques to the speech data that collects and the correspondence text Alignment, forms aligning texts;
The aligning texts are carried out sentence segmentation by step 3, so as to generate the sample in units of sentence;
Step 4, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to construct and obtains automatically The text rhythm data set of mark.
2. a kind of automatic speech rhythm according to claim 1 extracts mask method, it is characterised in that in the step 2 The alignment that described speech data and its correspondence text is carried out on time shafts is specifically referred to:So that the elementary cell in each text A period of time axle on the speech data can be corresponded to, so as to obtain the text in the corresponding voice number of each elementary cell According to fragment, wherein, the elementary cell refers to the word or word of Chinese, the word of English.
3. a kind of automatic speech rhythm according to claim 1 extracts mask method, it is characterised in that the step 4 is also Including:If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, needs to different declaimers Pronunciation custom do standardization respectively, and the prosodic features of the speech data are carried out at discretization as needed Reason.
4. a kind of a kind of automatic rhythm extracting method as described in any one of claim 1-3 is in natural language processing task Using, it is characterised in that the method includes:
Using the rhythm of text data as a sequence labelling task, artificial neural network is remembered using shot and long term(LSTM)For Rhythm modeling time series, the input of LSTM models is the corresponding term vector sequence of sentence, in each time point prediction and is exported The rhythm salience mark of current location elementary cell.
5. application of a kind of automatic rhythm extracting method according to claim 4 in natural language processing task, it is special Levy and be, the LSTM models extend to two-way LSTM networks, multi-layer biaxially oriented LSTM networks or time Recognition with Recurrent Neural Network and Its derived type and structure etc..
6. application of a kind of automatic rhythm extracting method according to claim 5 in natural language processing task, it is special Levy and be, the method also includes:
Text rhythm data set is used to be based on Recognition with Recurrent Neural Network(RNN)Sentence compression duty:By text rhythm salience Mark as nonproductive task, using sentence compression duty as main task, using the alternating training method under multi-task learning, often The individual time period is to a part of text rhythm data of the mode input or sentence compressed data, next time period input other Individual task, two tasks alternately, until the model convergence.
7. application of a kind of automatic rhythm extracting method according to claim 5 in natural language processing task, it is special Levy and be, the method also includes:
The natural language processing that text rhythm data set is used for based on Recognition with Recurrent Neural Network and its related expanding improved structure is appointed Business:The salience of the text rhythm is marked as nonproductive task, using sentence compression duty as main task, using multi-task learning Under alternating training method, each time period to a part of text rhythm data of the mode input or sentence compressed data, under One time period is input into another task, two tasks alternately, by optimizing the model parameter, until the model Convergence.
8. a kind of automatic speech rhythm extracts labeling system, it is characterised in that the system includes:
Acquisition module, receives speech data to be marked, obtains the corresponding text of the speech data;
Alignment module, the speech data that collects and its text are carried out using Text-To-Speech alignment techniques right on time shafts Together, aligning texts are formed;
The aligning texts are carried out sentence segmentation by segmentation module, generate the sample in units of sentence;
Automatic prosody labeler module, to the automatic rhythm salience dimensioning algorithm of each sentence application in the sample, so as to structure Make and obtain the text rhythm data set of automatic marking.
9. a kind of automatic speech rhythm according to claim 8 extracts labeling system, it is characterised in that the alignment module Described in speech data and its correspondence alignment that carries out on time shafts of text specifically refer to:So that substantially single in each text Unit can correspond to a period of time axle on the speech data, so as to obtain the text in each corresponding voice of elementary cell Data slot, wherein, the elementary cell refers to the word or word of Chinese, the word of English.
10. a kind of automatic speech rhythm according to claim 8 extracts labeling system, it is characterised in that the segmentation mould Block is additionally operable to:
If the environment read aloud comprising multiple declaimers or multiple differences in primary voice data, need to send out different declaimers Sound custom does respectively standardization, and as needed the prosodic features of the speech data is carried out into sliding-model control.
CN201710023633.8A 2017-01-13 2017-01-13 Automatic rhythm extracting method, system and application thereof in natural language processing Pending CN106683667A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710023633.8A CN106683667A (en) 2017-01-13 2017-01-13 Automatic rhythm extracting method, system and application thereof in natural language processing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710023633.8A CN106683667A (en) 2017-01-13 2017-01-13 Automatic rhythm extracting method, system and application thereof in natural language processing

Publications (1)

Publication Number Publication Date
CN106683667A true CN106683667A (en) 2017-05-17

Family

ID=58858838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710023633.8A Pending CN106683667A (en) 2017-01-13 2017-01-13 Automatic rhythm extracting method, system and application thereof in natural language processing

Country Status (1)

Country Link
CN (1) CN106683667A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108986798A (en) * 2018-06-27 2018-12-11 百度在线网络技术(北京)有限公司 Processing method, device and the equipment of voice data
WO2020024582A1 (en) * 2018-07-28 2020-02-06 华为技术有限公司 Speech synthesis method and related device
CN111105785A (en) * 2019-12-17 2020-05-05 广州多益网络股份有限公司 Text prosodic boundary identification method and device
CN111507104A (en) * 2020-03-19 2020-08-07 北京百度网讯科技有限公司 Method and device for establishing label labeling model, electronic equipment and readable storage medium
CN111989696A (en) * 2018-04-18 2020-11-24 渊慧科技有限公司 Neural network for scalable continuous learning in domains with sequential learning tasks
CN112136141A (en) * 2018-03-23 2020-12-25 谷歌有限责任公司 Robot based on free form natural language input control
CN112183086A (en) * 2020-09-23 2021-01-05 北京先声智能科技有限公司 English pronunciation continuous reading mark model based on sense group labeling
CN112307236A (en) * 2019-07-24 2021-02-02 阿里巴巴集团控股有限公司 Data labeling method and device
CN117012178A (en) * 2023-07-31 2023-11-07 支付宝(杭州)信息技术有限公司 Prosody annotation data generation method and device

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112136141A (en) * 2018-03-23 2020-12-25 谷歌有限责任公司 Robot based on free form natural language input control
US11972339B2 (en) 2018-03-23 2024-04-30 Google Llc Controlling a robot based on free-form natural language input
CN111989696A (en) * 2018-04-18 2020-11-24 渊慧科技有限公司 Neural network for scalable continuous learning in domains with sequential learning tasks
US12020164B2 (en) 2018-04-18 2024-06-25 Deepmind Technologies Limited Neural networks for scalable continual learning in domains with sequentially learned tasks
CN108986798A (en) * 2018-06-27 2018-12-11 百度在线网络技术(北京)有限公司 Processing method, device and the equipment of voice data
WO2020024582A1 (en) * 2018-07-28 2020-02-06 华为技术有限公司 Speech synthesis method and related device
CN112307236A (en) * 2019-07-24 2021-02-02 阿里巴巴集团控股有限公司 Data labeling method and device
CN111105785A (en) * 2019-12-17 2020-05-05 广州多益网络股份有限公司 Text prosodic boundary identification method and device
CN111507104A (en) * 2020-03-19 2020-08-07 北京百度网讯科技有限公司 Method and device for establishing label labeling model, electronic equipment and readable storage medium
US11531813B2 (en) 2020-03-19 2022-12-20 Beijing Baidu Netcom Science And Technology Co., Ltd. Method, electronic device and readable storage medium for creating a label marking model
CN112183086A (en) * 2020-09-23 2021-01-05 北京先声智能科技有限公司 English pronunciation continuous reading mark model based on sense group labeling
CN117012178A (en) * 2023-07-31 2023-11-07 支付宝(杭州)信息技术有限公司 Prosody annotation data generation method and device

Similar Documents

Publication Publication Date Title
CN106683667A (en) Automatic rhythm extracting method, system and application thereof in natural language processing
Liu et al. Learning natural language inference using bidirectional LSTM model and inner-attention
CN108597541B (en) Speech emotion recognition method and system for enhancing anger and happiness recognition
WO2023024412A1 (en) Visual question answering method and apparatus based on deep learning model, and medium and device
CN108255805A (en) The analysis of public opinion method and device, storage medium, electronic equipment
CN103761975B (en) Method and device for oral evaluation
CN110647612A (en) Visual conversation generation method based on double-visual attention network
CN109979429A (en) A kind of method and system of TTS
CN107220235A (en) Speech recognition error correction method, device and storage medium based on artificial intelligence
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN107731228A (en) The text conversion method and device of English voice messaging
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN107679225A (en) A kind of reply generation method based on keyword
CN112579794B (en) Method and system for predicting semantic tree for Chinese and English word pairs
CN104538025A (en) Method and device for converting gestures to Chinese and Tibetan bilingual voices
CN110717341A (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN110852089A (en) Operation and maintenance project management method based on intelligent word segmentation and deep learning
Kasai et al. End-to-end graph-based TAG parsing with neural networks
CN114360584A (en) Phoneme-level-based speech emotion layered recognition method and system
CN116029303A (en) Language expression mode identification method, device, electronic equipment and storage medium
CN116483314A (en) Automatic intelligent activity diagram generation method
CN115221284A (en) Text similarity calculation method and device, electronic equipment and storage medium
CN109002540A (en) It is a kind of Chinese notice document problem answers to automatic generation method
CN114943235A (en) Named entity recognition method based on multi-class language model
Sirirattanajakarin et al. BoydCut: Bidirectional LSTM-CNN Model for Thai Sentence Segmenter

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170517

WD01 Invention patent application deemed withdrawn after publication