CN103035241A

CN103035241A - Model complementary Chinese rhythm interruption recognition system and method

Info

Publication number: CN103035241A
Application number: CN2012105258769A
Authority: CN
Inventors: 刘文举; 倪崇嘉
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2012-12-07
Filing date: 2012-12-07
Publication date: 2013-04-10

Abstract

The invention discloses a model complementary Chinese rhythm interruption recognition system and a method. The model complementary Chinese rhythm interrupted recognition system includes a first step of inputting Chinese phonetic symbols, Chinese texts and segmentation boundary of every Chinese character in the Chinese phonetic symbols through a first input module, a second step of carrying out participle and part-of-speech tagging to the input Chinese texts through a participle and part-of-speech tagging module and obtaining the lexical feature and the grammatical feature of every Chinese character in the Chinese texts through calculation of a first lexical and grammatical feature calculation module, a third step of carrying out fundamental frequency extraction and sound intensity calculation to the input Chinese phonetic symbols through utilization of a fundamental frequency extraction and sound intensity calculation module by a first acoustic feature calculation module to obtain the acoustic feature of every Chinese character in the Chinese texts, and a fourth step of loading trained combined complementary models, identifying and judging the rhythm interruption type of every Chinese character through the acoustic features, the lexical features and the grammatical features of the input Chinese characters, and outputting the Chinese texts which are tagged with the rhythm interruption types.

Description

The Chinese rhythm of model complementary is interrupted recognition system and method

Technical field

The present invention relates to voice, language message technology, particularly a kind of Chinese rhythm of model complementary is interrupted recognition system and method.

Background technology

People are when carrying out communication, and what mutually transmit is not only the spoken and written languages information, and the expressed prosodic information of language also is an important transferring content, therefore, and the prosodic features Supersonic section feature that is otherwise known as.On the one hand, the rationalization of the rhythm can make the speaker can express information to be expressed clearly; On the other hand, clearer for obedient person to the correct understanding of prosodic information, understand the information of hearing exactly important help be provided.Achievement in research in recent decades also shows: the introducing of prosodic features, can reduce the complex nature of the problem in the error rate that reduces speech recognition, and increase the accuracy of system understanding, and the aspects such as naturalness of raising phonetic synthesis have very important effect.Trainable rhythm model based on statistics successfully has been applied to the phonetic synthesis field, and it is very helpful to the naturalness that improves phonetic synthesis.In field of speech recognition, rhythm model has successfully applied to the speech recognition of German, French.For Chinese, rhythm model also begins to be applied to step by step field of speech recognition.But the effect of application is not fine, and particularly in the natural language recognition field, and the prosodic information of utilization also seldom.Therefore, the research based on the statistics rhythm model of Large Scale Corpus for speech recognition and speech understanding aspect also needs further deeply.The language that people exchange is not only the hierarchical structure of unit, and the weight of unit plays very important effect too in the language.Rhythm interruption is partitioned into shorter one by one more convenient people's understanding or the junior unit of machine processing with long sentence with forms such as rhythm word, prosodic phrases.The rhythm is interrupted is playing the part of very important role aspect the naturalness of verbal exposition and the intelligibility.Because the rhythm is interrupted and is playing the part of more and more important effect in the speech engineering, therefore utilize computing machine, the rhythm is interrupted to identify more and more automatically is subject to people's attention by setting up model.Fig. 1 has illustrated the phantom aggregated(particle) structure of the Chinese rhythm and the mark example of stress.From this example, can find out, the phantom aggregated(particle) structure of the rhythm (intermittent configuration of " | " expression rhythm, " | " more the interruption level that shows of multilist is higher, and the time of pause is longer) and stress (Chinese character with underscore among the figure represents to read again) in the intelligibility of the modulation in tone of voice, voice, voice, playing the part of very important role in the aspect such as obtain of focus information.

In existing technology, the research that is interrupted the identification aspect about the rhythm can be divided into it two classes: a class is respectively to acoustic feature and dictionary, grammar property modeling, at last by the better model of the incompatible acquisition of set of weights; Another kind of is directly all features to be carried out modeling, as carrying out with integrated method integrated with weak sorter modeling or directly by carrying out someway modeling, obtaining to be used for judging that the rhythm is interrupted the model of type.Be for its weak point of first kind method: although portrayed relation between acoustic feature and dictionary feature, the grammar property by the method for weighting at last, the more profound contact between them is not utilized by model.From the level of model, the feature that these class methods have only utilized current syllable to provide.Its weak point of Equations of The Second Kind method is: although utilized from acoustics, dictionary and grammar property training pattern, strengthened the contact between acoustic feature and dictionary, the grammar property, do not utilized contextual feature at model hierarchy well.

For this reason, the Chinese rhythm that the present invention proposes a kind of model complementary is interrupted recognition system and method, can overcome to a great extent these deficiencies.The system and method that the present invention proposes will take full advantage of the information from aspects such as acoustics and dictionary, grammers, adopt Ensemble classifier regression tree (Boosting CART) method that all features are carried out modeling, simultaneously adopt condition random fields (CRFs) to carry out modeling to all features again, the method by weighting makes up these two kinds of models.Because Boosting CART model has not only reflected the again contact between the more profound upper reflection attribute of current syllable attribute well, CRFs can reflect again the contextual properties of this syllable well simultaneously, and this complementary characteristic on model hierarchy is so that the composition complementary model that weighting obtains later on has preferably recognition effect.

Summary of the invention

The technical matters that (one) will solve

Technical matters to be solved by this invention is to overcome in the prior art single method to the deficiency of a certain category feature modeling, proposition adopts different modeling methods that all used features the identification of rhythm interruption are carried out modeling from different sides, and the rhythm that different models merge is interrupted recognition methods.Test result shows, the method can improve the accuracy that the rhythm is interrupted identification.

(2) technical scheme

The Chinese rhythm that the present invention discloses a kind of model complementary is interrupted recognition methods, comprises the training step A of composition complementary model and utilizes the composition complementary model that the Chinese rhythm is interrupted identification step B;

Steps A: composition complementary model training module is utilized Ensemble classifier regression tree method training set constituent class regression tree model to acoustic feature, dictionary feature and the grammar property of Chinese character, simultaneously acoustic feature, dictionary feature and the grammar property of Chinese character are utilized condition random field method training condition random field models, and utilize the method for weighted array Ensemble classifier regression tree model and the conditional random field models that trains to be weighted combination, the composition complementary model that obtains training;

Step B: carry out the Chinese rhythm according to described composition complementary model and be interrupted identification.

Wherein, utilizing the composition complementary model that the Chinese rhythm is interrupted identification described in the step B specifically comprises the steps:

Step B1: the time cutting boundary information of each Chinese character in the first load module input Chinese speech, Chinese speech text, Chinese speech;

Step B2: participle and part-of-speech tagging module are carried out participle and part-of-speech tagging processing to the Chinese speech text of input, and the first dictionary feature and grammar property computing module calculate dictionary feature and the grammar property of each Chinese character in the Chinese speech text in conjunction with the result of described participle and part-of-speech tagging processing;

Step B3: fundamental frequency extracts, the loudness of a sound computing module carries out fundamental frequency extraction, loudness of a sound calculating to the Chinese speech of input, the first acoustics feature calculation module obtains the acoustic feature of each Chinese character in the Chinese language text in conjunction with the time cutting boundary information of each Chinese character in the Chinese speech;

Step B4: composition complementary model load-on module loads the described composition complementary model that trains, identification module utilizes acoustic feature, dictionary feature and the grammar property of each Chinese character, and the rhythm of using described each Chinese character of composition complementary Model Identification that trains is interrupted type;

Step B5: the rhythm is interrupted the annotation results memory module rhythm interruption type of each Chinese character is stored.

Wherein, the detailed process of the model training of composition complementary described in the steps A module composition complementary model that obtains training is as follows:

1: the second load module of steps A reads in Chinese speech and the prosodic labeling file corresponding with these voice from the corpus with prosodic labeling, described prosodic labeling file comprises that the time cutting boundary information of each Chinese character in Chinese speech text, the Chinese speech, the stress type of each Chinese character and the rhythm of each Chinese character are interrupted type;

Steps A 2: the second dictionary features, grammar property computing modules calculate dictionary feature and the grammar property of each Chinese character in the Chinese speech text in the described prosodic labeling file;

3: the second acoustics feature calculation of steps A module is calculated the acoustic feature of each Chinese character in the Chinese speech text in the described prosodic labeling file;

Steps A 4: the conditional random field models training module adopts the training of condition random field method and obtains conditional random field models acoustic feature, dictionary feature and the grammar property of each Chinese character of input;

Steps A 5: Ensemble classifier regression tree model training module adopts Ensemble classifier regression tree learning method that acoustic feature, dictionary feature and the grammar property training of each Chinese character of inputting are obtained Ensemble classifier regression tree model;

Steps A 6: the weighted array module is weighted combination to described conditional random field models and described Ensemble classifier regression tree model, obtains the final composition complementary model that the Chinese rhythm is interrupted identification that is used for

The Chinese rhythm that the invention also discloses a kind of model complementary is interrupted recognition system, and this system comprises:

Composition complementary model training module, be used for acoustic feature, dictionary feature and the grammar property of Chinese character are utilized Ensemble classifier regression tree method training set constituent class regression tree model, simultaneously acoustic feature, dictionary feature and the grammar property of Chinese character are utilized condition random field method training condition random field models, and utilize the method for weighted array Ensemble classifier regression tree model and the conditional random field models that trains to be weighted combination, the composition complementary model that obtains training;

The rhythm is interrupted identification module, is used for carrying out the Chinese rhythm according to described composition complementary model and is interrupted identification.

(3) beneficial effect

The Chinese rhythm of the model complementary that the present invention proposes is interrupted recognition methods and system carries out modeling by whole characteristic use Ensemble classifier regression tree (Boosting CART) method to acoustics, dictionary and grammer, again to whole features employing condition random field (CRFs) modeling of acoustics, dictionary and grammer, then Ensemble classifier regression tree (Boosting CART) model and the weighted array of condition random field (CRFs) model that obtains is obtained the higher composition complementary model of discrimination simultaneously.Boosting CART model has reflected whole attributes of current syllable, and the CRFs model reflects the contextual properties of this syllable, and this complementary characteristic on model is so that the composition complementary model that Boosting CART model and CRFs model generate later in weighted array can take full advantage of the complementary characteristic of model.The deficiency of being interrupted recognition methods before the present invention has overcome about the rhythm has improved the discrimination that the Chinese rhythm is interrupted.

Description of drawings

Fig. 1 is the participle of text and the schematic diagram of part-of-speech tagging and prosody hierarchy structure and stress mark;

Fig. 2 is the high-level schematic functional block diagram of being interrupted recognition system according to the Chinese rhythm of a kind of model complementary of the present invention;

Fig. 3 is the schematic diagram that the Chinese rhythm of system shown in Fig. 2 is interrupted identification division;

Fig. 4 is the schematic diagram of the composition complementary model training part of system shown in Fig. 2;

Fig. 5 is the process flow diagram that is interrupted recognition methods according to the Chinese rhythm of a kind of model complementary of the present invention.

Embodiment

For making the purpose, technical solutions and advantages of the present invention clearer, below in conjunction with specific embodiment, and with reference to accompanying drawing, the present invention is described in further detail.

Fig. 2 is the schematic diagram that is interrupted recognition system according to the Chinese rhythm of a kind of model complementary of the present invention.As shown in Figure 2, this system comprises that the Chinese rhythm is interrupted identification division and composition complementary model training part.Wherein, the described Chinese rhythm is interrupted identification division and comprises the first load module 101, participle and part-of-speech tagging module 102, fundamental frequency extracts, loudness of a sound computing module 103, the first dictionary feature, grammar property computing module 104, the first acoustics feature calculation modules 105, composition complementary model load-on module 106, identification module 107, the rhythm is interrupted annotation results memory module 108; Described composition complementary model training partly comprises composition complementary model training module 109.

Fig. 3 is the schematic diagram that the Chinese rhythm of system shown in Figure 2 is interrupted identification division.Below in conjunction with Fig. 3 the described Chinese rhythm being interrupted identification division is described in detail.

The first load module 101 is for the time cutting boundary information of input Chinese speech, Chinese speech text and each Chinese character of Chinese speech.Described input comprises the time segmental information of each Chinese character in Chinese speech, Chinese speech text and the Chinese speech of inputting the wav form.The time cutting boundary information of each Chinese character forces cutting to obtain by speech recognition system to it in the Chinese speech.Therefore, for each wav file, can obtain the time segmental information of each Chinese character in the wav file, and with the input of these information as load module.If more than one of the wav file of input, input then is the file that comprises the tabulation of wav file so.The text that the corresponding Chinese speech text corresponding with the wav file then is each wav file accounts for delegation.

Participle and part-of-speech tagging module 102, be used for that Chinese speech text to input carries out participle, (part-of-speech tagging refers to word is marked its part of speech information part-of-speech tagging, such as noun, verb, auxiliary word etc.) process, obtain participle and the part-of-speech tagging sequence of Chinese language text.This participle and part of speech mark module 102 are bases of text analyzing, and this is because the texts such as Chinese speech text and English are different, does not have the space as separator between word and word.Therefore, at first need the Chinese speech text of input is carried out participle and part-of-speech tagging processing, the result of acquisition is input to the first dictionary feature and grammar property computing module 104 (below will describe in detail), as the basis of subsequent treatment.When the Chinese speech text is carried out participle and part-of-speech tagging processing, not subdivisible atom (each atom can be a Chinese character, punctuate, numeric string, alphabetic string) taked first sentence to be divided in participle, then carry out by the following method rough segmentation, namely according to dictionary whole word is configured to a directed acyclic graph.Direction among the figure refers to that then word in the whole sentence is according to direction from left to right.Node is any one possible candidate's word among the figure, and the limit among the figure represents the continued access relation of adjacent two words, and the weights on each the bar limit among the figure represent the transition probability of binary word.Then calculate to try to achieve from starting point (be sentence) by dynamic programming and to the shortest N paths of terminal point (being the end of sentence), obtain the rough segmentation result of sentence.The rough segmentation result utilizes the Viterbi decoding to obtain final word segmentation result and part-of-speech tagging result as the input of hidden Markov model (HMM); Named entity recognition adopts the method based on statistics.That is, record respectively that each word appears in prefix, the word in the corpus, suffix, the independent probability that becomes word, then setting threshold determines by experimental formula whether single character should merge with the word of front.

Fundamental frequency extracts, loudness of a sound computing module 103, is used for the Chinese speech of input is carried out fundamental frequency extraction and loudness of a sound calculating.In a preferred embodiment, fundamental frequency extracts, the Chinese speech of 103 pairs of inputs of loudness of a sound computing module is sampled according to the 16K hertz, and quantize according to 16 bits, adopting window length to be that 25.6 milliseconds, frame move is 10 milliseconds Hamming (Hamming) window, calculates Mel cepstrum (MFCC) feature of each frame.Calculate fundamental frequency and the loudness of a sound of each frame of Chinese speech, the basis that the result of acquisition processes as the first follow-up acoustics feature calculation module 105.Fundamental frequency extracts, loudness of a sound computing module 103 adopts the fundamental frequency track algorithm of robusts that Chinese speech is extracted fundamental frequency, and for fundamental curve continuously, adopt the method for three Hermite interpolations of segmentation that fundamental curve is carried out interpolation processing.Simultaneously, in order to eliminate different speakers' impact, we also carry out Regularization to fundamental frequency.

The first dictionary feature, grammar property computing module 104 are used for calculating in conjunction with the result of participle and 102 outputs of part-of-speech tagging module dictionary feature and the grammar property of each Chinese character of Chinese speech text.When the dictionary feature of calculating each Chinese character and grammar property, mainly be to calculate from two aspects.Be dictionary feature and the grammar property of this Chinese character on the one hand, the position of the tone, part of speech, this Chinese character that comprises this Chinese character in word (in prefix, the word, suffix), this Chinese character are that border, this Chinese character of participle is the probability that is interrupted of the rhythm etc.On the other hand, consider the contextual properties of this Chinese character, comprise the dictionary of its left side Chinese character and the right Chinese character and grammar property etc.Consider the characteristics of Chinese, consider that mainly a rear Chinese character of the first two Chinese character of current Chinese character and current Chinese character is as contextual window.

The first acoustics feature calculation module 105 is for the acoustic feature of each Chinese character in the time cutting boundary information calculating Chinese language text of the voice Chinese character of inputting in conjunction with Output rusults and first load module 101 of fundamental frequency extraction, loudness of a sound computing module 103.The acoustic feature of each Chinese character mainly comprises the feature of duration, fundamental frequency and loudness of a sound aspect.The time long side feature comprise behind the duration of current Chinese character, the current Chinese character duration that quiet section duration, current Chinese character and the fundamental frequency of Chinese character thereafter interrupt.The feature of fundamental frequency, loudness of a sound aspect comprises fundamental frequency in contextual window of the statistical nature of fundamental frequency, loudness of a sound aspect of current Chinese character and current Chinese character, the statistical nature aspect the loudness of a sound.The statistical nature of the fundamental frequency of current Chinese character and loudness of a sound aspect is static statistical nature, comprises maximal value, minimum value, average, standard deviation, root mean square, codomain etc.Current Chinese character in contextual window fundamental frequency and the static statistics feature that is current Chinese character of the statistical nature aspect the loudness of a sound by the maximal value in the contextual window of current Chinese character and the codomain dynamic statistics feature that obtains of regularization respectively, change in order to characterize fundamental frequency and the loudness of a sound of this Chinese character in context.

Composition complementary model load-on module 106 is used for loading the composition complementary model.The composition complementary model that loads is good by composition complementary model training part precondition.

Identification module 107, be used for utilizing the composition complementary model of composition complementary model load-on module 106 loadings and the Hanzi features of input, comprise acoustic feature, dictionary feature and grammar property, the rhythm of identification Chinese character is interrupted type (that is: whether existing the rhythm to be interrupted behind this Chinese character), and the result of acquisition is interrupted the basis of annotation results memory module 108 as the rhythm.

The rhythm is interrupted annotation results memory module 108, is used for that the Chinese language text rhythm is interrupted annotation results and is written to storage medium.

The composition complementary model training partly is that the present invention distinguishes the key point of being interrupted recognition methods and system with other rhythm, also is simultaneously the key point that embodies this purport of model complementary.Fig. 4 is the schematic diagram of the composition complementary model training part of system shown in Figure 2.Below in conjunction with Fig. 4 described composition complementary model training partly is described in detail.

Described composition complementary model training partly comprises composition complementary model training module 109, this composition complementary model training module 109 is used for utilizing machine learning integrated study and condition random field modeling technique, employing has the machine learning method of supervision, utilization calculates the feature of Chinese character, comprise acoustic feature, dictionary feature and grammar property training obtain Ensemble classifier regression tree model and conditional random field models, adopt at last the method for weighted array to Ensemble classifier regression tree model and conditional random field models weighted array that training obtains, obtain the last composition complementary model that the Chinese rhythm is interrupted identification that is used for.In the weighted array process, the weight when utilizing the discrimination size of composition complementary model on the exploitation collection to regulate Ensemble classifier regression tree model and conditional random field models weighted array.When training pattern, adopt following parameter setting in a preferred embodiment of the present invention: with Ensemble classifier regression tree (Boosting) method integrated with weak sorter CART the time, integrated 100 Weak Classifier CART adopt 15 times cross validation method to improve the precision of Boosting CART model simultaneously.With the training of conditional random field models CRFs method the time, we adopt the gradient descent method training pattern.

As shown in Figure 4, composition complementary model training module 109 comprises: the second load module 201, the second dictionary feature, grammar property computing module 202, the second acoustics feature calculation module 203, conditional random field models training module 204, Ensemble classifier regression tree model training module 205, weighted array module 206 and composition complementary model memory module 207.Wherein,

The second load module 201 reads in Chinese speech and the prosodic labeling file corresponding with these voice from the corpus with prosodic labeling.The prosodic labeling file comprises that the time segmental information of each Chinese character in Chinese speech text, the Chinese speech, the stress type of each Chinese character and the rhythm of each Chinese character are interrupted type.In a preferred embodiment, the corpus with prosodic labeling that the present invention is used has comprised 10 speakers (5 boy students, 5 schoolgirls) voice, and 18 pieces of corpus of text have comprised 87586 Chinese characters (repetition is arranged).

The second dictionary feature, grammar property computing module 202 are for the dictionary feature and the grammar property that calculate each Chinese character of Chinese speech text.Chinese speech text in the prosodic labeling file is carried out participle and part-of-speech tagging, (part of speech information is to obtain by part-of-speech tagging module above to obtain the participle of Chinese speech text and part of speech information, part of speech comprises noun, verb, auxiliary word etc.), simultaneously in conjunction with the stress type in the prosodic labeling information, calculate dictionary feature and the grammar property of each Chinese character, the position of the tone, part of speech, this Chinese character that comprises this Chinese character in word (in prefix, the word, suffix), this Chinese character are that border, this Chinese character of participle is the probability that the rhythm is interrupted.Consider simultaneously the contextual properties of this Chinese character, comprise dictionary and the grammar property of its last Chinese character and a rear Chinese character.Consider the characteristics of Chinese, consider that mainly a rear Chinese character of the first two Chinese character of current Chinese character and current Chinese character is as contextual window.

The second acoustics feature calculation module 203 is for the acoustic feature that calculates each Chinese character of text.Chinese speech to input carries out fundamental frequency extraction and loudness of a sound calculating.In a preferred embodiment, we sample according to the 16K hertz to Chinese speech, and quantize according to 16 bits, calculate fundamental frequency and the loudness of a sound of each frame of Chinese speech, the basis that the result of acquisition processes as the first follow-up acoustics feature calculation module 203.Adopt the fundamental frequency track algorithm of robust that Chinese speech is extracted fundamental frequency, and in order to make the continuous of fundamental curve, adopt the method for three Hermite interpolations of segmentation that fundamental curve is carried out interpolation processing.Simultaneously, in order to eliminate different speakers' impact, and fundamental frequency is carried out regularization.

Conditional random field models training module 204 links to each other with grammar property computing module 202 with the second dictionary feature with the second acoustics feature calculation module 203, adopts condition random field method training condition random field models.Condition random field (CRFs) model is suitable for carrying out the mark of sequence data very much, and the function of powerful description contextual information is provided, and has good trainability.Sequence data has reflected state or the degree that a certain things, phenomenon etc. changed along with the time.

If X=x ₁... x _TSequence data, Y=y ₁... y _TClass label corresponding to sequence data X (or status switch), Λ={ λ ₁... λ _kBe the parameter of the conditional random field models of linear chain structure, then conditional random field models has defined a conditional probability P _Λ(Y|X) be:

P_{Λ} (Y | X) = \frac{1}{Z_{X}} \exp (Σ_{t = 1}^{T} \underset{k}{Σ} λ_{k} f_{k} (y_{t - 1}, y_{t}, x, t)) - - - (2)

Z wherein _XBe the regulator of input X, be used for guaranteeing that the probability sum of all status switches is 1; f _k(y _T-1, y _t, x, t) and be fundamental function, the output of this function can be real number arbitrarily, but normal operation only has the type of two outputs 0 and 1; λ _kFeature f _kCorresponding weight is the parameter of conditional random field models CRFs, need to obtain by training process study; K is index; T=1,2 ... T.The length of sequence data X is T, and fundamental function is f _k, sequence data X is y at t-1 class label constantly _T-1The class label that sequence data X is corresponding is Y.

Fundamental function f _k(y _T-1, y _t, x, t) and be used for estimating the state transitions y of any aspect _T-1→ y _tWith the whole observation sequence data X centered by time t, it is the unified representation of status flag function and transfer characteristic function.The value of fundamental function is two-valued function normally, or is 0, or is 1.In the defined feature function, at first make up the experience distribution characteristics of the incompatible description training data of real number value feature set of observation sequence, then each fundamental function is expressed as an element of the real number value characteristic set of observation sequence.If preceding state and current state have specific value, then all fundamental functions all are real number values.

If the parameter Λ of conditional random field models={ λ ₁... λ _kDetermine, then corresponding to the most probable class label sequence Y of sequence data X ^*For

Y^{*} = \underset{y}{\arg \max} P_{Λ} (Y | X) - - - (3)

Use improved Viterbi algorithm and A ^*Algorithm is asked for N the class label sequence (N-best list) that score is the highest.

As for the parameter Λ of conditional random field models={ λ ₁... λ _k, can ask for by maximal possibility estimation.Training set { (x _i, y _i): i=1 ... the log likelihood function that M} is corresponding can be write as

L_{Λ} = \underset{i}{Σ} \log P_{Λ} (y_{i} | x_{i}) = \underset{i}{Σ} (Σ_{t = 1}^{T} \underset{k}{Σ} λ_{k} f_{k} (y_{t - 1}, y_{t}, x_{t}) - \log Z_{x_{i}}) - - - (4)

L _ΛBe the log likelihood function of training set, represented the log Likelihood Score of training set, x _iBe i observation data constantly, i.e. dictionary feature, grammar property and acoustic feature, y _iBe the constantly class label of observation data dictionary feature, grammar property and acoustic feature of i, λ _kFundamental function f _kCorresponding weight is the parameter that needs training, z _xBe the regulator of each input, be used for guaranteeing that the probability sum of all status switches is that 1, M is the size of training set, T is the length of sequence data X.We adopt gradient descent method training condition random field models.By the training condition random field models, we obtain conditional random field models P _Λ(Y|X).When providing input feature vector sequence X=x ₁... x _TAfter, we can utilize the Viterbi decoding algorithm, obtain the most possible class label sequence Y of list entries X ^*Or N class label sequence that score is the highest.

Ensemble classifier regression tree model training module 205 and the second acoustics feature calculation module 203 and the second dictionary feature, grammar property computing module 202 link to each other, and adopt Ensemble classifier regression tree learning method that acoustic feature, dictionary feature and the grammar property training of input Chinese character are obtained Ensemble classifier regression tree model.

The Boosting method is popular in recent years a kind of ensemble machine learning algorithm that is used for improving the learning algorithm precision, and this algorithm is a kind of method that improves any given learning algorithm accuracy.The AdaBoost algorithm is the most representative algorithm of Boosting family, and the various Boosting algorithms that occur afterwards all are to develop on the basis of AdaBoost algorithm.Adopt the AdaBoost.M2 algorithm in a preferred embodiment of the present invention, selection sort regression tree (CART) is interrupted model of cognition as the Weak Classifier training rhythm simultaneously.Ensemble classifier regression tree (Boosting CART) method has not only reflected the again contact between profound level reflection attribute of current syllable attribute well, has simultaneously good trainability.Therefore, we adopt Boosting CART method to whole features of each Chinese character, comprise acoustic feature, dictionary feature and grammar property, and training rhythm is interrupted model.Weighted array module 206, Ensemble classifier regression tree model training module 205 are connected connection with the conditional random field models training module, adopt the method for weighted array to obtain the final composition complementary model that the Chinese rhythm is interrupted identification that is used for, in the weighted array process, the weight when utilizing the discrimination size of composition complementary model on the exploitation collection to regulate Boosting CART model and the weighted array of CRFs model.We adopt formula (4) that Boosting CART model and CRFs model are weighted combination, obtain final composition complementary model.

If W={w ₁, w ₂..., w _nThe sequence that the Chinese character in the sentence forms, A={a ₁, a ₂..., a _nCorresponding acoustic feature sequence, S={s ₁, s ₂..., s _nCorresponding dictionary and grammar property sequence, most possible rhythm break sequence P of W so ^*Can be expressed as:

P ^*＝argmaxp(P|A，S)

＝argmax(λ·p(P|A，S)+(1-λ)·p(P|A，S))

＝argmax(λ·p ₁(P|A，S)+(1-λ)·p ₂(P|A，S)) (5)

Wherein, λ is weighted value, is the parameter between 0 and 1, and P is rhythm break sequence, P ^*Be most possible rhythm break sequence, p (P|A, S) is to be A when the acoustic feature sequence of inputting, and dictionary and grammar property sequence are in the situation of S, produce the probability of rhythm break sequence P, p ₁(P|A, S) and p ₂(P|A, S) is by selecting the probability of the rhythm break sequence P that diverse ways or identical Method Modeling obtain.In a preferred embodiment of the present invention, select to obtain p with Ensemble classifier regression tree Method Modeling ₁(P|A, S) obtains p with the condition random field Method Modeling ₂(P|A, S).

Ensemble classifier regression tree modeling method has been portrayed acoustics, dictionary, the grammar property of current syllable well, and the condition random field modeling method has been portrayed the context property of current syllable well on model hierarchy.Say in this sense, these two kinds of modeling methods are complementary.Because the model complementary of Ensemble classifier regression tree model and conditional random field models, the composition complementary model that weighted array obtains later on can remedy the deficiency of two models.This composition complementary model has not only reflected the again contact between profound level reflection attribute of current syllable attribute, has portrayed again the context property of current syllable simultaneously, therefore, has good recognition performance.

Composition complementary model memory module 207 is connected with weighted array module 206, the composition complementary model that storage has trained.

Native system can realize in computing machine, server or computer network, and its input equipment is preferably the equipment such as keyboard, mouse, microphone, communication agency.

The present invention also provides a kind of Chinese rhythm of model complementary to be interrupted recognition methods, and the method comprises the training step that utilizes the composition complementary model Chinese rhythm to be interrupted identification step and composition complementary model; Wherein,

The described composition complementary model that utilizes is interrupted identification step to the Chinese rhythm:

Step 1: input Chinese speech, Chinese speech text, and Chinese speech in the time cutting boundary information of each Chinese character;

Step 2: the Chinese language text of input is carried out participle and part-of-speech tagging process, calculate each Chinese character dictionary feature and grammar property in the Chinese speech text in conjunction with the result of participle and part-of-speech tagging;

Step 3: the Chinese speech of input carried out fundamental frequency extracts and loudness of a sound calculates, and in conjunction with the time cutting boundary information of each Chinese character in the Chinese speech, obtain the acoustic feature of each Chinese character in the Chinese speech text;

Step 4: load the composition complementary model, and utilize the composition complementary model to each Hanzi features of input, comprise acoustic feature, dictionary feature and grammar property, the rhythm of identifying each Chinese character is interrupted type;

Step 5: the rhythm of exporting each Chinese character is interrupted type.

The training step of described composition complementary model comprises and utilizes composition complementary model training module to the acoustic feature of Chinese character, dictionary feature and grammar property utilize the method training set constituent class regression tree model of Ensemble classifier regression tree (BoostingCART), simultaneously again to the acoustic feature of Chinese character, dictionary feature and grammar property utilize condition random field (Conditional Random Fields, CRFs) method training condition random field models, utilize at last the method for weighted array that Ensemble classifier regression tree model and the conditional random field models that trains is weighted combination, obtain the composition complementary model.Described Chinese speech text to input carries out participle and part-of-speech tagging is processed, comprise the tone, participle, the part of speech information that obtain Chinese character in the text, and according to dictionary feature and the grammar property of each Chinese character in the tone that obtains in the text-processing, participle, the part of speech information calculations Chinese language text.Described composition complementary model is the weight when utilizing the discrimination size of composition complementary model on the exploitation collection to regulate Ensemble classifier regression tree model and conditional random field models weighted array in the weighted array process.

The training step of described composition complementary model:

Step 1: read in Chinese speech and the prosodic labeling file corresponding with these voice from the corpus with prosodic labeling.The prosodic labeling file has marked the time cutting boundary information of each Chinese character in Chinese speech text, the Chinese speech, the stress type of each Chinese character and the rhythm of each Chinese character and has been interrupted type.

Step 2: the dictionary feature and the grammar property that calculate each Chinese character in the text.Chinese language text in the prosodic labeling file is carried out participle and part-of-speech tagging, obtain participle and the part of speech information of Chinese language text, simultaneously in conjunction with the tone information in the prosodic labeling information, calculate dictionary feature and the grammar property of each Chinese character, the position of the tone, part of speech, this Chinese character that comprises this Chinese character in word (in prefix, the word, suffix), this Chinese character are that border, this Chinese character of participle is the probability that the rhythm is interrupted.Consider simultaneously the contextual properties of this Chinese character, comprise dictionary and the grammar property of its front Chinese character and back Chinese character.Consider the characteristics of Chinese, a rear Chinese character of the first two Chinese character of the current Chinese character of main consideration and current Chinese character is as contextual window in a preferred embodiment of the present invention.

Step 3: the acoustic feature that calculates each Chinese character in the text.The input Chinese speech is carried out fundamental frequency extraction and loudness of a sound calculating.

Step 4: to acoustics, dictionary and the grammar property of input, adopt condition random field method training condition random field models.

Step 5: to acoustics, dictionary and the grammar property of input, adopt Ensemble classifier regression tree learning method training set constituent class regression tree model.

Step 6: on the exploitation collection, to conditional random field models and the Ensemble classifier regression tree model of input, the weight of adjustment model, training obtains the composition complementary model.

Above-described specific embodiment; purpose of the present invention, technical scheme and beneficial effect are further described; be understood that; the above only is specific embodiments of the invention; be not limited to the present invention; within the spirit and principles in the present invention all, any modification of making, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims

1. the Chinese rhythm of a model complementary is interrupted recognition methods, comprises the training step A of composition complementary model and utilizes the composition complementary model that the Chinese rhythm is interrupted identification step B;

2. the method for claim 1 is characterized in that, utilizes the composition complementary model that the Chinese rhythm is interrupted identification described in the step B and specifically comprises the steps:

3. the method for claim 1 is characterized in that, the detailed process of the composition complementary model that the model training of composition complementary described in steps A module obtains training is as follows:

Steps A 6: the weighted array module is weighted combination to described conditional random field models and described Ensemble classifier regression tree model, obtains the final composition complementary model that the Chinese rhythm is interrupted identification that is used for.

4. method as claimed in claim 3 is characterized in that, described steps A 6 realizations specific as follows:

If W={w ₁, w ₂..., w _nThe sequence that the Chinese character in the sentence forms, A={a ₁, a ₂..., a _nCorresponding acoustic feature sequence, S={s ₁, s ₂..., s _nCorresponding dictionary and grammar property sequence, the most possible rhythm of W is interrupted mark sequence P so ^*Can be expressed as:

P ^*＝argmaxp(P|A，S)

＝argmax(λ·p(P|A，S)+(1-λ)·p(P|A，S))

＝argmax(λ·p ₁(P|A，S)+(1-λ)·p ₂(P|A，S)) (1)

Wherein, λ is weighted value, is the parameter between 0 and 1, and P is rhythm break sequence, P ^*Be most possible rhythm break sequence, p (P|A, S) is to be A when the acoustic feature sequence of inputting, and dictionary grammar property sequence is in the situation of S, produces the probability of rhythm break sequence P, p ₁(P|A, S) is the probability of the rhythm break sequence P that obtains with Ensemble classifier regression tree Method Modeling, p ₂(P|A, S) is the probability that utilizes the rhythm break sequence P that the condition random field Method Modeling obtains.

5. method as claimed in claim 4 is characterized in that, described weighted value λ utilizes the discrimination size of described composition complementary model on the exploitation collection to regulate.

6. method according to claim 2, it is characterized in that, participle and part-of-speech tagging module are carried out participle and part-of-speech tagging processing to the Chinese speech text of input among the described step B2, are used for obtaining tone, participle, the part of speech information of Chinese speech text Chinese character.

7. method according to claim 2, it is characterized in that, fundamental frequency extraction among the described step B3, loudness of a sound computing module carry out fundamental frequency extraction, loudness of a sound calculating to the Chinese speech of input, specifically comprise: the Chinese speech to input is sampled according to the 16K hertz, then quantize according to 16 bits, and adopt window length to be that 25.6 milliseconds and frame move to be the Mel cepstrum feature that 10 milliseconds Hamming window calculates each frame of Chinese speech, to calculate again fundamental frequency, the loudness of a sound of each frame of Chinese speech.

8. the Chinese rhythm of a model complementary is interrupted recognition system, and this system comprises:

9. system as claimed in claim 8 is characterized in that, the described rhythm is interrupted identification module and specifically comprises:

The first load module is for the time cutting boundary information of input Chinese speech, Chinese speech text, each Chinese character of Chinese speech;

Participle and part-of-speech tagging module are used for the Chinese speech text of input is carried out participle and part-of-speech tagging processing;

The first dictionary feature and grammar property computing module are used for calculating in conjunction with the result that described participle and part-of-speech tagging are processed dictionary feature and the grammar property of each Chinese character of Chinese speech text;

Fundamental frequency extracts, the loudness of a sound computing module, is used for the Chinese speech of input is carried out fundamental frequency extraction, loudness of a sound calculating;

The first acoustics feature calculation module is used for the time cutting boundary information in conjunction with each Chinese character of Chinese speech, obtains the acoustic feature of each Chinese character in the Chinese language text;

Composition complementary model load-on module is used for loading the described composition complementary model that trains;

Identification module, for the acoustic feature, dictionary feature and the grammar property that utilize each Chinese character, the rhythm of using described each Chinese character of composition complementary Model Identification that trains is interrupted type;

The rhythm is interrupted the annotation results memory module, is used for the rhythm interruption type of each Chinese character is stored.

10. system as claimed in claim 8 is characterized in that, described composition complementary model training module comprises:

The second load module, be used for reading in Chinese speech and the prosodic labeling file corresponding with these voice from the corpus with prosodic labeling, described prosodic labeling file comprises that the time cutting boundary information of each Chinese character in Chinese speech text, the Chinese speech, the stress type of each Chinese character and the rhythm of each Chinese character are interrupted type;

The second dictionary feature, grammar property computing module are for dictionary feature and the grammar property of each Chinese character in the Chinese speech text that calculates described prosodic labeling file;

The second acoustics feature calculation module is for the acoustic feature of each Chinese character in the Chinese speech text that calculates described prosodic labeling file;

The conditional random field models training module is used for acoustic feature, dictionary feature and the grammar property of each Chinese character of input are adopted the training of condition random field method and obtain conditional random field models;

Ensemble classifier regression tree model training module be used for to adopt Ensemble classifier regression tree learning method that acoustic feature, dictionary feature and the grammar property training of each Chinese character of inputting are obtained Ensemble classifier regression tree model;

The weighted array module is used for described conditional random field models and described Ensemble classifier regression tree model are weighted combination, obtains the final composition complementary model that the Chinese rhythm is interrupted identification that is used for.