CN101777347A

CN101777347A - Model complementary Chinese accent identification method and system

Info

Publication number: CN101777347A
Application number: CN200910250394A
Authority: CN
Inventors: 刘文举; 倪崇嘉
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2009-12-07
Filing date: 2009-12-07
Publication date: 2010-07-14
Anticipated expiration: 2029-12-07
Also published as: CN101777347B

Abstract

The invention relates to a model complementary Chinese accent identification method and a system thereof. The model complementary Chinese accent identification method comprises the steps that: firstly, Chinese speed sounds, Chinese texts and a segmentation boundary of each Chinese character in the Chinese speed sounds are input by a first input module, and a word segmentation and part-of-speech tagging module carries out the word segmentation and part-of-speech tagging for the input Chinese texts to obtain the dictionary feature and grammatical feature of each Chinese character in the Chinese texts; a first acoustic feature calculation module uses fundamental frequency extraction, band-pass energy calculation and sound intensity calculation module to carry out fundamental frequency extraction, frequency band energy calculation and calculation of sound intensity for the input Chinese speed sounds to obtain the acoustic feature of each Chinese character in the Chinese texts; and the trained complementary models are loaded, the acoustic features as well as the dictionary feature and the grammatical feature of the input Chinese character are used to identify and judge the accent type of each Chinese character, and the Chinese texts labeled accent types are output.

Description

A kind of Chinese accent identification method of model complementary and system

Technical field

The present invention relates to voice, language message technology, specifically, the present invention relates to be used to discern the stress type of technology of Chinese speech.

Technical background

People are when carrying out communication, and what mutually transmit is not only the spoken and written languages information, and the expressed prosodic information of language also is an important transmission content, therefore, and the prosodic features Supersonic section feature that is otherwise known as.On the one hand, the rationalization of the rhythm can make the speaker can express information to be expressed clearly; On the other hand, clearer to the correct understanding of prosodic information for obedient person, understand the information of hearing exactly important help be provided.Achievement in research in recent decades also shows: the introducing of prosodic features, can reduce the complex nature of the problem in the error rate that reduces speech recognition, and increase the accuracy of system understanding, and the aspects such as naturalness of raising phonetic synthesis have important effect.Trainable rhythm model based on statistics successfully has been applied to the phonetic synthesis field, and it is very helpful to the naturalness that improves phonetic synthesis.In field of speech recognition, rhythm model has successfully applied to the speech recognition of German, French.For Chinese, rhythm model also begins to be applied to step by step field of speech recognition.But the effect of application is not fine, particularly be in natural language identification field, and the prosodic information of utilization also seldom.Therefore, the research based on the statistics rhythm model of extensive corpus at speech recognition and speech understanding aspect also needs further deeply.The language that people exchange is not only the hierarchical structure of each unit, and the weight of each unit plays important effect too in the language.Stress is that the ratio other parts that certain a part of sound is sent out in polysyllabic speech when speaking, phrase or the sentence are strong more, outstanding, thereby louder, clear.Can reduce the focus information in the error rate of speech recognition, the naturalness that improves phonetic synthesis, the acquisition voice, the intelligibility of increase voice etc. to the accurate judgement of stress.Because the more and more important effect of the performer of stress in the speech engineering, therefore utilize computing machine, stress is discerned more and more automatically be subject to people's attention by setting up model.Fig. 1 has illustrated the phantom aggregated(particle) structure of the Chinese rhythm and the mark example of stress.From this example as can be seen, the phantom aggregated(particle) structure of the rhythm (intermittent configuration of " | " expression rhythm, " | " the interruption level that shows of multilist is high more more, and the time of pause is long more) and stress (Chinese character of band underscore is represented to read again among the figure) in the intelligibility of the modulation in tone of voice, voice, voice, playing the part of very important part in the aspect such as obtain of focus information.

In existing technology, it can be divided into two classes about the research of stress identification aspect: a class is respectively to acoustic feature and dictionary, grammar property modeling, at last by the better model of the incompatible acquisition of set of weights; Another kind of is directly all features to be carried out modeling, as carrying out modeling with integrated method integrated with weak sorter or directly by carrying out modeling someway, obtaining to be used to judge the model of stress type.Be for its weak point of first kind method: though portrayed relation between acoustic feature and dictionary feature, the grammar property by the method for weighting at last, the more profound contact between them is not utilized by model.From the level of model, these class methods have only been utilized the feature that current syllable provided.Its weak point of second class methods is: though utilized from acoustics, dictionary and grammar property training pattern, strengthened the contact between acoustic feature and dictionary, the grammar property, but do not give prominence to the key points as the feature of acoustics or grammer aspect, more on model hierarchy, do not utilize contextual feature well.Our method overcomes these deficiencies to a great extent.We will make full use of the information from aspects such as acoustics and dictionary, grammers.Adopt integrated classification regression tree (Boosting CART) method that all features are carried out modeling, simultaneously adopt condition random field (CRFs) to carry out modeling to dictionary feature and grammar property again, the method by weighting makes up these two kinds of models.Though seem the information (having reused dictionary, syntactic information) that repeats of having utilized during modeling, but the also recycling of these information just, not only given prominence to the key points when making modeling dictionary, syntactic information, and on model more profound, reflected relation between acoustic feature and dictionary, the grammar property.Because Boosting CART model has not only reflected the more profound again contact of going up between the reflection attribute of current syllable attribute well, CRFs can reflect the contextual properties of this syllable again well simultaneously, and the complementation model that this complementary characteristic on model hierarchy makes weighting obtain later on has recognition effect preferably.

Summary of the invention

The objective of the invention is to, a kind of Chinese accent identification method and system of model complementary is provided, in order to overcome the deficiency in the above-mentioned stress identification, to improve the accuracy of stress identification.

In order to realize the foregoing invention purpose, a first aspect of the present invention provides a kind of Chinese accent identification method of model complementary, and this method comprises following two parts: the training step A of complementation model and utilize complementation model to Chinese accent identification step B;

Steps A: the training step of complementation model is to utilize the complementation model training module acoustic feature, dictionary feature and the grammar property of Chinese character to be utilized the method training set constituent class regression tree model of integrated classification regression tree (Boosting CART), simultaneously again dictionary feature and grammar property are utilized condition random field (Conditional Random Fields, CRFs) method training condition random field models, utilize the method for weighted array that integrated classification regression tree model and the conditional random field models that trains is weighted combination at last, obtain complementation model;

Step B: utilize complementation model that Chinese accent is discerned, comprise the steps:

Step B1: the cutting border of each Chinese character in first load module input Chinese speech, Chinese language text, Chinese speech;

Step B2: with participle, part-of-speech tagging module the Chinese language text of input is carried out participle and part-of-speech tagging processing, the first dictionary feature, grammar property computing module calculate each Chinese character dictionary feature and grammar property in the Chinese language text in conjunction with the result of participle and part-of-speech tagging;

Step B3: with the first acoustic feature computing module be utilize that fundamental frequency extracts, the logical energy of band calculates, the loudness of a sound computing module carries out to the Chinese speech of input that fundamental frequency extracts, 500 hertz to 2000 hertz energy calculate and loudness of a sound calculates, and, obtain the acoustic feature of each Chinese character in the Chinese language text in conjunction with the cutting boundary information of each Chinese character in the Chinese speech;

Step B4: load the complementation model that the complementation model training module trains with complementation model load-on module and identification module, utilize each Hanzi features, comprise acoustic feature, dictionary feature and grammar property, use the stress type that the complementation model that loads is discerned each Chinese character;

Step B5: the stress type of each Chinese character is stored with stress annotation results memory module.

Preferred embodiment, described Chinese language text to input carries out participle and part-of-speech tagging is handled, obtain tone, participle, the part of speech information of Chinese character in the text, and according to the dictionary feature and the grammar property of each Chinese character in the tone that obtains in the text-processing, participle, the part of speech information calculations Chinese language text.

Preferred embodiment, Chinese speech to input is sampled according to the 16K hertz, quantize according to 16 bits, and employing window length is Mel cepstrum (MFCC) feature that 25.6 milliseconds and frame move each frame of Hamming window (Hamming) window calculating Chinese speech that is 10 milliseconds, energy and the loudness of a sound of the fundamental frequency, 500 hertz that calculates each frame of Chinese speech again in 2000 hertz band, and calculate the acoustic feature of each Chinese character according to the cutting boundary information of each Chinese character in the Chinese speech of input.

Preferred embodiment, described complementation model are the weights when utilizing the discrimination size of complementation model on the exploitation collection to regulate integrated classification regression tree model and conditional random field models weighted array in the weighted array process.

In order to realize the foregoing invention purpose, a second aspect of the present invention provides the system based on the Chinese accent identification of model complementary, and described system comprises:

First load module receives the time segmental information of importing each Chinese character in Chinese speech, Chinese language text and the Chinese speech;

Participle, part-of-speech tagging module are connected with first load module, and participle, part-of-speech tagging module are carried out participle, part-of-speech tagging processing to the Chinese language text of input, obtain the participle and the part-of-speech tagging sequence of Chinese language text;

Fundamental frequency extracts, the logical energy of band calculates, the loudness of a sound computing module is connected with first load module, and the Chinese speech of input is handled, and comprises that fundamental frequency extracts, 500 hertz to 2000 hertz the logical energy calculating of band, loudness of a sound calculating;

The first dictionary feature is connected with the part-of-speech tagging module with participle with the grammar property computing module, and the first dictionary feature, grammar property computing module are dictionary and the grammar properties that calculates each Chinese character in the Chinese language text in conjunction with the result of participle and part-of-speech tagging;

The first acoustic feature computing module extracts with fundamental frequency, the logical energy of band calculates, the loudness of a sound computing module is connected, utilize fundamental frequency extraction, the logical energy of band to calculate and loudness of a sound result of calculation, and calculate the acoustic feature of each Chinese character in the Chinese language text in conjunction with the cutting boundary information of the Chinese character of importing;

Complementation model load-on module and identification module and the first acoustic feature computing module and the first dictionary feature, grammar property computing module are connected, the complementation model load-on module loads complementation model, and acoustic feature, dictionary feature and the grammar property of each Chinese character that complementation model utilization calculating obtains are discerned the stress type of this Chinese character;

Stress annotation results memory module is connected with identification module with the complementation model load-on module, is used to store the result to the stress type mark of Chinese language text;

The training of complementation model training module is used for the complementation model of Chinese accent identification.

Preferred embodiment, described complementation model training module comprises:

Second load module: from corpus, read in prosodic labeling Chinese speech and with the pairing prosodic labeling file of these voice;

The second acoustics feature calculation module links to each other with second load module, the input Chinese speech carried out fundamental frequency extracts, 500 hertz to 2000 hertz the logical energy of band calculates and loudness of a sound calculates, and calculate the acoustic feature of each Chinese character according to the cutting boundary information of Chinese speech in the prosodic labeling;

The second dictionary feature links to each other with second load module with the grammar property computing module, and the Chinese language text in the prosodic labeling is carried out participle and part-of-speech tagging, and the while is in conjunction with the dictionary feature and the grammar property of each Chinese character of prosodic labeling information calculations;

Integrated classification regression tree model training module and the second acoustics feature calculation module and the second dictionary feature, grammar property computing module link to each other, and adopt acoustic feature, dictionary feature and the grammar property training set constituent class regression tree model of integrated classification regression tree learning method to the input Chinese character;

The conditional random field models training module links to each other with the second dictionary feature, grammar property computing module, adopts condition random field method training condition random field models;

The weighted array module is connected with the conditional random field models training module with integrated classification regression tree model training module, adopt the method for weighted array to obtain the final complementation model that is used for Chinese accent identification, in the weighted array process, the weight when utilizing the discrimination size of complementation model on the exploitation collection to regulate Boosting CART model and the weighted array of CRFs model;

The complementation model memory module is connected with the weighted array module, the complementation model that storage has trained.

Useful effect of the present invention is, by to carrying out modeling from the integrated classification regression tree of whole characteristic use of acoustics, dictionary and grammer (Boosting CART) method, again dictionary feature and grammar property are adopted condition random field (CRFs) modeling simultaneously, at last integrated classification regression tree (Boosting CART) model and condition random field (CRFs) model that utilizes above-mentioned two kinds of modeling methods to obtain obtained the higher complementation model of discrimination by weighted array.Though Boosting CART model and CRFs model have reused when modeling from the information aspect dictionary, the grammer, but the recycling of this information just, make that the information of dictionary, grammer aspect is emphasized, make that the contextual information on model hierarchy is emphasized.And, Boosting CART model has reflected that not only whole attributes of current syllable reflect the contact between each attribute again, the CRFs model reflects the contextual properties of this syllable simultaneously, and this complementary characteristic on model makes Boosting CART model and CRFs model generate the complementary characteristic that complementation model can make full use of model later in weighted array.Before having overcome, the present invention, improved the discrimination of Chinese accent about the deficiency of accent identification method.

Description of drawings

Fig. 1, be the participle of text and part-of-speech tagging and rhythmite level structure and stress mark synoptic diagram;

Fig. 2, for system architecture diagram of the present invention;

Fig. 3, for the FB(flow block) of system embodiment of the present invention;

Fig. 4, for system of the present invention Chinese accent identification process block diagram;

Fig. 5, be system model training FB(flow block) of the present invention.

Embodiment

Describe each related detailed problem in the technical solution of the present invention in detail below in conjunction with accompanying drawing.Be to be noted that described embodiment only is intended to be convenient to the understanding of the present invention, and it is not played any qualification effect.

As shown in Figure 2, the present invention is a kind of Chinese accent recognition system of model complementary, and described system comprises: first load module 101: the time segmental information of each Chinese character in input Chinese speech, Chinese language text and the Chinese speech; Participle, part-of-speech tagging module 102: the Chinese language text to input carries out participle, part-of-speech tagging is handled, and obtains the participle and the part-of-speech tagging sequence of Chinese language text; Fundamental frequency extracts, is with logical energy to calculate and loudness of a sound computing module 103: the Chinese speech of importing is carried out fundamental frequency extract, be with logical energy to calculate and loudness of a sound calculating; The first dictionary feature, grammar property computing module 104: the dictionary and the grammar property that calculate each Chinese character in the Chinese language text in conjunction with the result of participle and part-of-speech tagging; The first acoustic feature computing module 105: in conjunction with fundamental frequency extract, the logical energy of band calculates and the segmental information of the Chinese character of the result of loudness of a sound computing module and input calculates the acoustic feature of each Chinese character in the Chinese language text; Model loads 106 and identification module 107:, dictionary relevant to the acoustics that calculates each Chinese character that obtains is correlated with and grammar property, and stress model is discerned the stress type of this Chinese character; Memory module 108: to the result of the stress type stores of Chinese language text mark; Complementation model training module 109: utilize integrated study and condition random field modeling technique in the machine learning, employing has the machine learning method of supervision, utilization calculates the feature of Chinese character, comprise that acoustic feature, dictionary feature and grammar property training obtain integrated classification regression tree model and conditional random field models, integrated classification regression tree model and the conditional random field models weighted array of adopting the method for weighted array that training is obtained at last, the complementation model that be used for Chinese accent identification to the end.In the weighted array process, the weight when utilizing the discrimination size of complementation model on the exploitation collection to regulate integrated classification regression tree model and conditional random field models weighted array.

Native system can realize in computing machine, server or computer network that its first load module can be equipment such as keyboard, mouse, microphone, communication agency.

Embodiment

As shown in Figure 3, system mainly comprises Chinese accent identification division and complementation model training part.Wherein the Chinese accent identification division mainly extracts, is with logical energy calculating, loudness of a sound calculating 103, dictionary, grammar property computing module 104, the first acoustic feature computing module 105, model load-on module 106, identification module 107 and stress annotation results memory module 108 to form by first load module 101, participle and part-of-speech tagging module 102, fundamental frequency.First load module 101 comprises the time segmental information of each Chinese character in Chinese speech, Chinese language text and the Chinese speech of input wav form.The time cutting boundary information of each Chinese character forces cutting to obtain by speech recognition system to it in the Chinese speech.Therefore, we are time segmental informations of importing each Chinese character in the wav file by the mlf file layout of input HTK instrument.If more than one of the wav file of input, what we imported so is the tabulation of wav file.The text that the corresponding Chinese language text corresponding with the wav file then is each wav file accounts for delegation.The text of participle and 102 pairs of inputs of part of speech mark module carries out participle and part-of-speech tagging is handled, this participle and part of speech mark module 102 are bases of text analyzing, this is because texts such as Chinese language text and English are different, does not have the space as separator between speech and speech.Therefore, at first need the text of input is carried out participle and part-of-speech tagging processing, the result of acquisition is written to the first dictionary feature, grammar property computing module 104, as the basis of subsequent treatment.When text was carried out participle and part-of-speech tagging processing, we adopted existing participle in this laboratory and part-of-speech tagging instrument.Participle has taked elder generation that sentence is divided into not subdivisible atom, and (each molecule can be a Chinese character, punctuate, numeric string, alphabetic string), then by the rough segmentation of N-Best method, promptly whole word is configured to a directed acyclic graph according to dictionary, the speech that all nodes constitute on this limit of limit tabular form of figure is the speech in the dictionary, calculate the shortest N paths of trying to achieve from the starting point to the terminal point by dynamic programming then, obtain the rough segmentation result of sentence, the rough segmentation result obtains final word segmentation result and part-of-speech tagging result simultaneously as the input of hidden Markov model (HMM); Named entity recognition adopts the method based on statistics, writes down respectively that each word appears in prefix, the speech in the corpus, suffix, the independent probability that becomes speech, and setting threshold determines by experimental formula whether single word should close with the speech of front then.Fundamental frequency extracts, the logical energy of band calculates, loudness of a sound computing module 103 is that the Chinese speech of input is sampled according to the 16K hertz, quantize according to 16 bits, and adopt window length to be that 25.6 milliseconds, frame move to be 10 milliseconds Hamming (Hamming) window, to calculate Mel cepstrum (MFCC) feature of each frame.Energy and the loudness of a sound of the fundamental frequency, 500 hertz that calculates each frame of Chinese speech in 2000 hertz band, the basis that the result of acquisition handles as the first follow-up acoustic feature computing module 105.We adopt the fundamental frequency track algorithm RAPT of robust that Chinese speech is extracted fundamental frequency, and in order to make the continuous of fundamental curve, we adopt the method for three Hermite interpolation of segmentation that fundamental curve is carried out interpolation processing.Simultaneously, in order to eliminate different speakers' influence, we adopt Z-SCORE algorithm regularization fundamental frequency.When calculating Mel cepstrum (MFCC) feature, we adopt 24 triangular filter groups to carry out computing, and the vector that the output of each wave filter constitutes is carried out discrete cosine transform DCT, get preceding 12 coefficients, add the energy of fundamental frequency simultaneously, totally 13 dimensions.We pass through the voice of importing to be used for the calculating of top Mel cepstrum (MFCC) feature on the one hand after the FFT conversion, also are used for calculating band simultaneously and lead to energy calculating.We adopt 500 hertz to 2000 hertz Gaussian filters calculating energy in frequency domain.Dictionary, grammar property computing module 104 are exactly the result according to participle and 102 text analyzings of part-of-speech tagging module, calculate the dictionary feature and the grammar property of each Chinese character in the Chinese language text.When the dictionary feature of calculating each Chinese character and grammar property, we calculate from two aspects.Be the dictionary feature and the grammar property of this Chinese character on the one hand, the position of the tone, part of speech, this Chinese character that comprises this Chinese character in speech (in prefix, the speech, suffix), this Chinese character are that the border, this Chinese character of participle is by the probability of reading again.Consider the contextual properties of this Chinese character simultaneously, comprise the dictionary and the grammar property of its left side Chinese character and the right Chinese character.Consider the characteristics of Chinese, we consider that mainly a back Chinese character of preceding two Chinese characters of current Chinese character and current Chinese character is as contextual window.The first acoustic feature computing module 105 calculates the acoustic feature of each Chinese character simultaneously according to the result that fundamental frequency extracts, band leads to energy calculating, loudness of a sound computing module 103 in conjunction with the time cutting boundary information of Chinese character in the voice of first load module, 101 inputs.The acoustic feature of each Chinese character.Comprise that mainly the logical energy of fundamental frequency, band of current Chinese character and the statistical nature and fundamental frequency, the band of current Chinese character in contextual window of loudness of a sound aspect lead to the statistical nature aspect energy and the loudness of a sound.The statistical nature of the fundamental frequency of current Chinese character, the logical energy of band and loudness of a sound aspect is static statistical nature, comprises maximal value, minimum value, average, standard deviation, root mean square, codomain.Statistical nature aspect the fundamental frequency of current Chinese character in contextual window, the logical energy of band and the loudness of a sound is that the static statistics feature of current Chinese character is by the maximal value in the contextual window of current Chinese character and the codomain dynamic statistics feature that obtains of regularization respectively, in order to characterize the acoustical change of this Chinese character in context.Complementation model load-on module 106 is exactly to load complementation model.The complementation model that loads is that precondition is good.Identification module 107 is exactly to utilize the complementation model of complementation model load-on module loading and the Hanzi features of input, comprise acoustic feature, dictionary feature and grammar property, the stress type of identification Chinese character, the result of acquisition is as the basis of stress annotation results memory module 108.Stress annotation results memory module 108 is exactly that the result that the Chinese language text stress marks is written on the storage medium.

Complementation model training module 109 comprises second load module 201, the second dictionary feature, grammar property computing module 202, the second acoustics feature calculation module 203, conditional random field models training module 204, integrated classification regression tree model training module 205, weighted array module 206 and complementation model memory module 207 compositions.The complementation model training module is a theme part of the present invention, is that the present invention distinguishes the key point with other accent identification method and system, also is simultaneously the key point that embodies this purport of model complementary.Second load module 201 from the corpus with prosodic labeling read in Chinese speech and with the corresponding prosodic labeling file of these voice.The prosodic labeling file has marked the time segmental information of each Chinese character in Chinese speech text, the Chinese speech, the stress type of each Chinese character and the rhythm of each Chinese character and has been interrupted type.The corpus with prosodic labeling that we are used has comprised 10 speakers (5 boy students, 5 schoolgirls) voice, and 18 pieces of corpus of text have comprised 87586 Chinese characters (repetition is arranged).The second dictionary feature and grammar property computing module 202 are used for calculating the dictionary feature and the grammar property of each Chinese character of text.We carry out participle and part-of-speech tagging to the Chinese language text in the prosodic labeling, obtain the participle and the part of speech information of Chinese language text, simultaneously in conjunction with the tone information in the prosodic labeling information, calculate the dictionary feature and the grammar property of each Chinese character, the position of the tone, part of speech, this Chinese character that comprises this Chinese character in speech (in prefix, the speech, suffix), this Chinese character are that the border, this Chinese character of participle is by the probability of reading again.Consider the contextual properties of this Chinese character simultaneously, comprise the dictionary and the grammar property of its left side Chinese character and the right Chinese character.Consider the characteristics of Chinese, we consider that mainly a back Chinese character of preceding two Chinese characters of current Chinese character and current Chinese character is as contextual window.The second acoustics feature calculation module 203 is used for calculating the acoustic feature of each Chinese character of text.The input Chinese speech carried out fundamental frequency extracts, 500 hertz to 2000 hertz the logical energy of band calculates and loudness of a sound calculates.Chinese speech according to the 16K hertz of sampling, is quantized according to 16 bits, and adopt window length to be that 25.6 milliseconds, frame move to be 10 milliseconds Hamming (Hamming) window, to calculate Mel cepstrum (MFCC) feature of each frame.Energy and the loudness of a sound of the fundamental frequency, 500 hertz that calculates each frame of Chinese speech in 2000 hertz band, the basis that the result of acquisition handles as the first follow-up acoustic feature computing module 105.We adopt the fundamental frequency track algorithm RAPT of robust that Chinese speech is extracted fundamental frequency, and in order to make the continuous of fundamental curve, we adopt the method for three Hermite interpolation of segmentation that fundamental curve is carried out interpolation processing.Simultaneously, in order to eliminate different speakers' influence, we adopt Z-SCORE algorithm regularization fundamental frequency.When calculating Mel cepstrum (MFCC) feature, we adopt 24 triangular filter groups to carry out computing, and the vector that the output of each wave filter constitutes is carried out discrete cosine transform DCT, get preceding 12 coefficients, add the energy of fundamental frequency simultaneously, totally 13 dimensions.We pass through the voice of importing to be used for the calculating of top Mel cepstrum (MFCC) feature on the one hand after the FFT conversion, also are used for calculating band simultaneously and lead to energy calculating.We adopt 500 hertz to 2000 hertz Gaussian filters calculating energy in frequency domain.Conditional random field models training module 204 links to each other with the second dictionary feature, grammar property computing module 202, adopts condition random field method training condition random field models.Condition random field (CRFs) model is suitable for carrying out the mark of alphabetic data very much, and the function of powerful description contextual information is provided, and has good trainability.

Conditional random field models can followingly be represented.For sequence data X=x ₁... x _TCorresponding mark (status switch) Y=y ₁... y _T, the parameter of the conditional random field models of linear chain structure is A={ λ ₁... λ _k, then it has defined a conditional probability P _Λ(Y|X) be:

P_{Λ} (Y | X) = \frac{1}{Z_{X}} \exp (Σ_{t = 1}^{T} \underset{k}{Σ} λ_{k} f_{k} (y_{t - 1}, y_{t}, x_{t})) - - - (1)

Z wherein _XBe the regulator of each input, be used for guaranteeing that the probability sum of all state series is 1; f _k(y _T-1, y _t, x t) is fundamental function, the output of this function can be real number arbitrarily, but the general type of having only two outputs of 0-1 of using; λ _kBe feature f _kCorresponding weight is the parameter of conditional random field models CRFs, need obtain by training process study; K is an index; T=1,2 ... T.The length of sequence data X is T, and fundamental function is f _k, sequence data X is at the y that is labeled as in the t-1 moment _T-1, the mark of sequence data X correspondence is Y.

Fundamental function is used for estimating the state transitions y of any aspect _T-1→ y _tWith the whole observation sequence data X that with time t is the center.If the parameter Λ of conditional random field models={ λ ₁... λ _kDetermine that then the most probable corresponding to sequence data X marks sequences y ^*For

y^{*} = \underset{y}{\arg \max} P_{Λ} (Y | X) - - - (2)

Use improved Viterbi algorithm and A ^*Algorithm is asked for N best mark sequence (N-best list).

As for the parameter of model, can ask for by maximal possibility estimation.Training set { (x _i, y _i): i=1 ... the log likelihood function of M} correspondence can be write as

L_{Λ} = \underset{i}{Σ} \log P_{Λ} (y_{i} | x_{i}) = \underset{i}{Σ} (Σ_{t = 1}^{T} \underset{k}{Σ} λ_{k} f_{k} (y_{t - 1}, y_{t}, x_{t}) - {\log Z}_{xi}) - - - (3)

L _ΛBe the log likelihood function of training set, x _iBe i observation data constantly, y _iBe the i mark of observation data constantly, λ _kBe feature f _kCorresponding weight is the parameter that needs training, Z _XBe the regulator of each input, be used for guaranteeing that the probability sum of all state series is 1, M is the size of training set, and T is the length of sequence data X.We adopt gradient descent method training condition random field models.

Integrated classification regression tree model training module 205 and the second acoustics feature calculation module 203 and the second dictionary feature, grammar property computing module 202 link to each other, and adopt acoustic feature, dictionary feature and the grammar property training set constituent class regression tree model of integrated classification regression tree learning method to the input Chinese character.

Boosting is popular in recent years a kind of integrated machine learning algorithm that is used for improving the learning algorithm precision, and this algorithm is any given learning algorithm method of accuracy of a kind of raising.Its thought originates from PAC (the Probably Approximately Correct) learning model that Valiant proposes.Valiant and Kearns have proposed the notion of weak study and strong study, and the identification error rate is less than 1/2,, also be that accuracy rate only is called weak learning algorithm than the slightly high learning algorithm of conjecture at random; Recognition accuracy finish in polynomial time very much learning algorithm of Gao Bingneng is called strong learning algorithm.Simultaneously, the equivalence question of learning algorithm and strong learning algorithm a little less than Valiant and Kearns have proposed in the PAC learning model first, promptly given arbitrarily only than guessing slightly good weak learning algorithm at random, it can be promoted and be strong learning algorithm? if the two equivalence, so only need find one just it can be promoted than the weak learning algorithm that conjecture is slightly good at random and to be strong learning algorithm, and needn't seek the strong learning algorithm that is difficult to acquisition.Nineteen ninety, Schapire constructs a kind of algorithm of polynomial expression level at first, and this problem has been done sure proof, Here it is initial Boosting algorithm.After 1 year, Freund has proposed the higher AdaBoost algorithm of a kind of efficient.But there is the defective in the common practice in these two kinds of algorithms, and that is exactly the lower limit that all requires to know in advance weak learning algorithm study accuracy.Nineteen ninety-five, Freund and Shapire have improved the Boost algorithm, have proposed the AdaBoost algorithm, and the Freund algorithm that this efficiency of algorithm and Freund proposed in 1991 much at one, but without any need for priori, thereby easier being applied in the middle of the practical problems about weak learner.Afterwards, Freund and Schaipre have further proposed to change the AdaBoost.M1 of Boosting ballot weight, and the AdaBoost.M2 scheduling algorithm has received great concern in the machine learning field.Afterwards, a lot of people have proposed many similar algorithms again.The AdaBoost algorithm is the most representative algorithm of Boosting family, and the various Boosting algorithms of Chu Xianing all are development and come on the basis of AdaBoost algorithm afterwards.We adopt the AdaBoost.M2 algorithm, and selection sort regression tree (CART) is as Weak Classifier training stress model of cognition simultaneously.Integrated classification regression tree (Boosting CART) method has reflected well that not only current syllable attribute again in the contact between the reflection attribute on the profound level, has good trainability simultaneously.Therefore, we adopt the whole features of Boosting CART method to each Chinese character, comprise acoustic feature, dictionary feature and grammar property, training stress model.The weighted array module is connected with the conditional random field models training module with integrated classification regression tree model training module, adopt the method for weighted array to obtain the final complementation model that is used for Chinese accent identification, in the weighted array process, the weight when utilizing the discrimination size of complementation model on the exploitation collection to regulate Boosting CART model and the weighted array of CRFs model.We adopt formula (4) that Boosting CART model and CRFs model are weighted combination, obtain final complementation model.

If W={w ₁, w ₂..., w _nBe syllable sequence, A={a ₁, a ₂..., a _nBe corresponding acoustic feature sequence, S={s ₁, s ₂..., s _nBe corresponding dictionary feature, grammar property sequence, the most possible stress of W marks sequence P so ^*Can be expressed as:

P^{*} = \arg \max p (P | A, S)

\approx \arg \max p (P | A, S) p (P | S)

\approx \arg \max Π_{i = 1}^{n} p {(p_{i} | a_{i}, s_{i}, φ (s_{i}))}^{λ} p (p_{i} | s_{i}, φ (s_{i}))

\approx \arg \max λ Σ_{i = 1}^{n} \log (p (p_{i} | a_{i}, s_{i}, φ (s_{i}))) + Σ_{i = 1}^{n} \log (p (p_{i} | s_{i}, φ (s_{i}))) - - - (4)

Wherein, log (p (p _i| a _i, φ (s _i))) be the score of integrated classification regression tree model, log (p (p _i| φ (s _i))) be the score of conditional random field models, λ distinguishes the weight that distinct methods is set up model, and p represents probability distribution, and P represents the stress mark sequence of syllable sequence W, P ^*The most possible stress mark sequence of expression syllable sequence W, A is the acoustic feature sequence, S is dictionary feature, grammar property sequence, p _iBe the stress mark of i syllable, a _iBe the acoustic feature of i syllable, s _iBe dictionary feature, the grammar property of i syllable, φ (s _i) be the dictionary in contextual window, the grammar property of i syllable, i=1,2 ... n.

Though seem the information (having reused dictionary, syntactic information) that repeats of having utilized when Boosting CART modeling and CRFs modeling, but the also recycling of these information just, not only given prominence to the key points when making modeling dictionary, syntactic information, and reflected relation between acoustic feature and dictionary feature, the grammar property on more profound.Because the model complementary of Boosting CART model and CRFs model, the complementation model that weighted array obtains later on can remedy the deficiency of two models.This complementation model has not only reflected current syllable attribute again in the contact between the reflection attribute on the profound level, and the while has been portrayed the context property of current syllable again, therefore, has good recognition performance.

Complementation model memory module 207 is connected with weighted array module 206, the complementation model that storage has trained.

Fig. 4 has provided the identification division FB(flow block) of system of the present invention, is the part of Fig. 3.Mainly be better the identification division and the training part of system to be separated.

Fig. 5 has provided the model training part FB(flow block) of system of the present invention, is the part of Fig. 3.When training pattern, we adopt following parameter setting: with Boosting method integrated with weak sorter CART the time, we are integrated 100 CART adopt 15 times cross validation method to improve the precision of Boosting CART model simultaneously.With the training of CRFs method the time, we adopt gradient descent method training pattern.

The above; only be the embodiment among the present invention; but protection scope of the present invention is not limited thereto; anyly be familiar with the people of this technology in the disclosed technical scope of the present invention; can understand conversion or the replacement expected; all should be encompassed in of the present invention comprising within the scope, therefore, protection scope of the present invention should be as the criterion with the protection domain of claims.

Claims

1. the Chinese accent identification method of a model complementary is characterized in that, utilizes the Chinese accent recognition system to realize that this method comprises following two parts: the training step A of complementation model and utilize complementation model to Chinese accent identification step B;

Step B2: with participle, part-of-speech tagging module the Chinese language text of input is carried out participle and part-of-speech tagging processing, the first dictionary feature and grammar property computing module calculate each Chinese character dictionary feature and grammar property in the Chinese language text in conjunction with the result of participle and part-of-speech tagging;

2. method according to claim 1, it is characterized in that, described Chinese language text to input carries out participle and part-of-speech tagging is handled, obtain tone, participle, the part of speech information of Chinese character in the text, and according to the dictionary feature and the grammar property of each Chinese character in the tone that obtains in the text-processing, participle, the part of speech information calculations Chinese language text.

3. method according to claim 1, it is characterized in that, Chinese speech to input is sampled according to the 16K hertz, quantize according to 16 bits, and employing window length is Mel cepstrum (MFCC) feature that 25.6 milliseconds and frame move each frame of Hamming window (Hamming) window calculating Chinese speech that is 10 milliseconds, energy and the loudness of a sound of the fundamental frequency, 500 hertz that calculates each frame of Chinese speech again in 2000 hertz band, and calculate the acoustic feature of each Chinese character according to the cutting boundary information of each Chinese character in the Chinese speech of input.

4. method according to claim 1 is characterized in that, described complementation model is the weight when utilizing the discrimination size of complementation model on the exploitation collection to regulate integrated classification regression tree model and conditional random field models weighted array in the weighted array process.

5. the Chinese accent recognition system of a model complementary is characterized in that, this system comprises:

The first dictionary feature is connected with the part-of-speech tagging module with participle with the grammar property computing module, and the first dictionary feature and grammar property computing module are dictionary and the grammar properties that calculates each Chinese character in the Chinese language text in conjunction with the result of participle and part-of-speech tagging;

6. system as claimed in claim 5 is characterized in that, described complementation model training module comprises: