CN101246685A

CN101246685A - Pronunciation quality evaluation method of computer auxiliary language learning system

Info

Publication number: CN101246685A
Application number: CNA200810102076XA
Authority: CN
Inventors: 刘加
Original assignee: Tsinghua University
Current assignee: Beijing Huacong Zhijia Technology Co Ltd
Priority date: 2008-03-17
Filing date: 2008-03-17
Publication date: 2008-08-20
Anticipated expiration: 2028-03-17
Also published as: CN101246685B

Abstract

The present invention belongs to voice field of technology, pronunciation quality evaluation method of computer Computer Aided Language Learning system includes: calculation of the matching fraction, calculation of the sensing fraction based on Mel frequency scale, calculation of the segment length fraction and calculation of the keynote fraction, and processing fusion after mapping the above fractions; the pronunciation quality evaluation method of the invention has better robustness, high pertinency with the expert evaluation, used for interactive language learn and automatic spoken language test.

Description

Pronunciation quality evaluating method in the computer auxiliary language learning system

Technical field

The invention belongs to the voice technology field, specifically, relate to the method for utilizing voice process technology to realize sound pronunciation quality assessment in the computer auxiliary language learning system.

Background technology

When the learner carried out language learning, it was the Core Feature of computer auxiliary language learning system that its voice quality is estimated reliably.Yet because the limitation of prior art, the performance of present pronunciation quality evaluating method is desirable not enough, also has a certain distance apart from practicability.

Current, the method for utilizing computing machine that voice quality is estimated mainly is based on hidden Markov model (HMM).The Chinese invention patent application discloses a kind of pronunciation quality evaluating method that is used for learning machine No. 200510114848.8.This method adopts hidden Markov model training standard pronunciation model, and the search optimal path, calculates the confidence score that is used for estimating voice quality with this.This method too much depends on the training of hidden Markov model, and in influencing a plurality of factors of voice quality, only estimate voice quality based on the factor relevant with acoustic model, therefore the correlativity with expert's scoring is high not enough, and the correlativity that word and short sentence pronunciation machine scoring and expert are marked only is 0.74.

In communication system, also relate to the evaluation of voice quality.ITU-T discloses a kind of voice quality assessment method that is used for telephone channel in P.862.At first reference voice is obtained tested speech by telephone channel.Then reference voice and tested speech are mapped to the perception territory, accurately estimate the time-delay of tested speech, calculate the difference in perception of tested speech at last in the perception territory, and come the voice quality of evaluation test voice with this with respect to reference voice with respect to reference voice.

Yet voice quality assessment method and the pronunciation quality evaluating in the computer auxiliary language learning system in the communication system are different.At first, in voice communication system, the factor that influences voice quality generally is the various noises that cause of telephone channel, codec to the time-delay to voice of the damage of voice and network.Reference voice and tested speech all are same speaker's same sentence voice, if therefore do not consider time-delay, each phoneme in the tested speech does not generally have segment length's variation.Whether correct voice quality assessment process to communication system can not exert an influence people's sound pronunciation.And in computer auxiliary language learning system, influence the factor more complicated of voice quality.The distortion of learner's tested speech really causes owing to cacoepy and noise is had little or nothing to do with.And if with teacher's pronunciation as the reference voice, learner's voice are investigated the voice quality of tested speech with respect to reference voice as tested speech.Reference voice and tested speech cause the length of tested speech and reference voice different from different speakers so, and the difference of this voice length do not delay time and cause, and therefore can't directly aim at.Secondly, different speakers' sound channel length difference causes the resonance peak of same phoneme in tested speech and the reference voice not exclusively the same.In addition, the rhythm in two speaker's voice changes also different, its directly performance be exactly that the stress of tested speech and reference voice changes.Two people's fundamental tone is also inequality, and corresponding fundamental tone change procedure also has bigger difference.

Computer auxiliary language learning system should imitate the process of expert to pronunciation quality evaluating as far as possible.Usually, expert's process that voice quality is estimated can be divided into three steps.At first, listen to tested speech by earphone or audio amplifier.Then, brain is handled the voice that perceive, and according to oneself phonetics and linguistic knowledge, reference voice and tested speech is compared, and finds mispronounce and the distortion of tested speech on (for example acoustic layer and rhythmite) at all levels.At last, comprehensive above-mentioned various distortions provide the overall assessment of tested speech.As seen, closely related to the result of the perception of voice quality and pronunciation quality evaluating.And prior art fails on the rhythm level voice quality to be estimated, and lacks the research to the perceptual distortion aspect of voice quality.

The present invention is directed to problems of the prior art, propose the pronunciation quality evaluating method in a kind of computer auxiliary language learning system.This method with teacher's pronunciation as the reference voice, aspect acoustics, perception and the rhythm, calculate the voice quality difference of learner's tested speech respectively with respect to reference voice, obtain mating mark, perception mark, segment length's mark and fundamental tone mark, and described four kinds of marks are merged, obtain the final mark of tested speech.Related coefficient according to the pronunciation quality evaluating mark that the present invention is directed to word and short sentence and expert's scoring reaches 0.800, and performance is better than the method based on prior art.

Summary of the invention

The correlativity that present pronunciation quality evaluating method based on hidden Markov model obtains machine evaluation score and expert's scoring is high not enough, can not satisfy in the present computer auxiliary language learning system the pronunciation quality evaluating requirement.The objective of the invention is to overcome the deficiencies in the prior art, propose a kind of pronunciation quality evaluating method that is used for computer auxiliary language learning system.Propose to utilize teacher's reference voice and student's tested speech to calculate coupling mark, perception mark, segment length's mark, fundamental tone mark respectively from acoustics, perception and rhythm aspect among the present invention, and come method that voice quality is marked with merging after described these marks mapping, can reach 0.800 at the correlativity of the machine voice quality scoring of word and short sentence and the scoring of expert's subjective quality.

Pronunciation quality evaluating method in the computer auxiliary language learning system that the present invention proposes mainly comprises: calculate the coupling mark, calculate the perception mark based on the Mel frequency marking, calculate the fundamental tone mark, calculate segment length's mark; Described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark are shone upon, and each mark after will shining upon merges.This method makes full use of that multiple pronunciation information obtains reliably to merge mark in the sound pronunciation, thereby student's sound pronunciation quality is estimated (marking).It is characterized in that each several part specifically comprises following step:

1, the computing method of described coupling mark, described perception mark, described fundamental tone mark, described segment length's mark based on the Mel frequency marking, concrete steps are as follows:

(A) calculating of described coupling mark comprises: (1) utilizes the Viterbi decoding algorithm respectively tested speech and reference voice to be carried out forced alignment, obtains the time-division information and the likelihood score of the phoneme of reference voice and tested speech; (2) thoroughly deserve the coupling mark according to the difference of the likelihood score of tested speech and received pronunciation.

(B) calculating of described perception mark comprises: (1) applies the Mel wave filter to tested speech and reference voice; (2), the energy output of the Mel wave filter of reference voice and tested speech is mapped as loudness based on power law; (3) based on described phoneme time-division information, by dynamic time warping (Dynamic Time Warping, DTW) method is carried out further refinement frame by frame with reference voice and tested speech and is aimed on the phoneme aspect; (4) calculate the perception mark based on the loudness difference of reference voice and each frame of tested speech.

(C) calculating of described segment length's mark comprises: (1) utilizes the logarithm segment length probability in segment length's Model Calculation tested speech and the received pronunciation based on described time-division information; (2) difference according to segment length's probability absolute value of segment length's probability of tested speech and received pronunciation obtains segment length's mark.

(D) calculating of described fundamental tone mark comprises: (1) extracts the fundamental tone of reference voice and tested speech respectively; (2), obtain the maximum value and the minimal value of the interior fundamental tone of each vowel in reference voice and the tested speech respectively, and calculate the poor of interior maximum value of each vowel and minimal value based on described time-division information; (3) poor based on maximum value and minimal value in each vowel in reference voice and the tested speech.

2, described mark mapping is calculated and is comprised: based on one in Sigmoid function, polynomial function or the linear function described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark are shone upon, mapping back mark is in the identical interval with expert's scoring.

3, described mark merges to calculate and comprises: linear fusions, support vector machine (SVM), Logistic return in (LogisticRegression), neural network, the gauss hybrid models one described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark after shining upon are merged, by the complementation of multiple information, make and merge the back mark more near the mark of expert opinion.

Described Viterbi decoding algorithm has utilized from what extensive speech database was trained and based on hidden Markov model (HMM) tested speech and received pronunciation has been carried out forced alignment, and the time-division information of reference voice that obtains and tested speech can be the time-division of the time-division information of phoneme, state, the time-division information of word.The time-division information of phoneme is one of them.The training method of HMM model is according to maximum-likelihood criterion, and HMM model parameter (bag average and variance) is carried out valuation.

Described segment length's model is histogram model or the Gamma model that obtains the segment length by extensive speech database training.

It is on based on the time-division information basis of mating each phoneme that obtains in the fractional computation process that described refinement is frame by frame aimed at, further utilize dynamic time warping (Dynamic Time Warping, DTW) method is aimed at the refinement frame by frame of carrying out of reference voice and tested speech on the phoneme aspect, makes the voice of aligning that better comparability be arranged on frame.

The present invention proposes pronunciation quality evaluating method in the computer auxiliary language learning system when learner's voice quality is estimated, and performance is better than the level of prior art.It is good that pronunciation quality evaluating method of the present invention has robustness, with expert's high advantage of correlativity of marking, can be used for language learner and realize in interacting language learning pronunciation quality evaluating and the automatic speech test macro.

The present invention has following advantage:

(1) the present invention has made full use of teacher's reference voice and student's tested speech pronunciation difference characteristics and has estimated;

(2) the perception fractional computation complexity based on the Mel frequency marking of the present invention's proposition is lower than the perception fractional computation method based on critical band, and performance is better;

(3) the present invention has made full use of the multiple evaluation information in the pronunciation, match information, perception information, segment length's information, Pitch Information, and carried out information fusion, at different marks various pronunciation information are carried out complementation, improved the robustness of estimating, and with the correlativity of expert's scoring;

(4) of the present inventionly also can be applied to multilingual study based on the pronunciation evaluation method in the computer auxiliary language learning system, it is good to have robustness, with expert's high characteristics of correlativity of marking, and the present invention can realize on present palm PC, PDA(Personal Digital Assistant) or learning machine that its range of application is very extensive.

Description of drawings

Fig. 1 is the general illustration of pronunciation quality evaluating method;

Fig. 2 is the calculating synoptic diagram of coupling mark;

Fig. 3 is a HMM model topology structure;

Fig. 4 is the calculating synoptic diagram of perception mark;

Fig. 5 is the calculating synoptic diagram of segment length's model;

Fig. 6 is the calculating synoptic diagram of fundamental tone mark;

Fig. 7 machine mark merges synoptic diagram.

Embodiment

Embodiment to the pronunciation quality evaluating method that is used for computer-assisted language learning of the present invention's proposition is elaborated below in conjunction with accompanying drawing.Fig. 1 is the overview flow chart according to pronunciation quality evaluating method of the present invention.(1) at first reference voice and tested speech go out to mate mark, perception mark, segment length's mark and fundamental tone mark through acoustic model, sensor model, segment length's model and fundamental tone Model Calculation respectively.(2) mark of these being described the voice quality of aspects such as acoustics, perception and the rhythm respectively carries out the mark fusion.(3) with the mark after merging the voice quality of tested speech is estimated.

Reference voice is meant the Received Pronunciation as the teacher of the benchmark of pronunciation quality evaluating, and tested speech is meant the voice as the learner of the evaluation object of voice quality.Therefore, in pronunciation quality evaluating method of the present invention, need to calculate tested speech with respect to the pronunciation of reference voice in qualitative difference.The whole computation process details of the embodiment of the invention is constructed as follows:

1, coupling fractional computation:

Fig. 2 is the synoptic diagram of coupling mark.At first respectively reference voice and tested speech are carried out the processing of branch frame, obtain dividing stably in short-term the frame voice.Then every frame voice are extracted Mei Er frequency marking cepstrum coefficient (MFCC) feature.Wherein, the MFCC feature that every frame voice are extracted comprises 39 dimensions, 12 ties up MFCC coefficient and first order difference and second order differences, normalized energy and first order difference thereof and second order difference that is:.The MFCC feature has reflected the static nature of voice, and the single order of MFCC and second order difference coefficient have then reflected the behavioral characteristics of voice.Utilize the hidden Markov model (HMM) that trains then, adopt the Viterbi decoding algorithm respectively reference voice and tested speech to be carried out forced alignment, obtain the time-division information of likelihood mark He each phoneme of reference voice and tested speech.Here, the training process of HMM belongs to known technology to those skilled in the art, and is therefore only briefly bright to its work here.HMM adopts state transition model from left to right, and this model can be described the pronunciation characteristic of voice well.For example available employing 3 state hidden Markov models, its topological structure as shown in Figure 3.Q wherein _iThe state of expression HMM, a _IjThe redirect probability of expression HMM, b _j(O _t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model, as shown in Equation (1):

b_{j} (O_{t}) = Π_{s = 1}^{S} {[Σ_{m = 1}^{M_{S}} C_{jsm} N (O_{st}; μ_{jsm}; φ_{jsm})]}^{γ_{s}} - - - (1)

Wherein, S is the number of data stream, M _sBe the number of the mixed Gaussian Density Distribution in each data stream, N is the higher-dimension Gaussian distribution, as shown in Equation (2):

N (o; μ; φ) = \frac{1}{\sqrt{{(2 π)}^{n} | φ |}} e^{- \frac{1}{2} (o - μ) φ^{- 1} (o - μ)} - - - (2)

Tested speech and reference voice are to be made of a plurality of phonemes.After respectively reference voice and tested speech being carried out forced alignment, the coupling mark L (i) of i phoneme is provided by following formula:

L(i)＝|log(p _text(O _test|q _i))-log(p _ref(O _ref|q _i))| (3)

Wherein, p _Test(O _Test| q _i) be the likelihood score of tested speech, p _Ref(O _Ref| q _i) be the likelihood score of reference voice.Wherein, q _iRepresent i phoneme HMM model, O _TestAnd O _RefIt is respectively the MFCC eigenvector of tested speech and reference voice.

The coupling mark is defined as phoneme and on average mates mark:

S_{mat_sen} = \frac{1}{N_{p}} Σ_{i = 1}^{N_{p}} L (i) - - - (4)

Wherein, N _pBe the total number of phoneme in the sound pronunciation, L (i) is the coupling mark of i phoneme.

2, perception fractional computation:

Fig. 4 is the calculating synoptic diagram of perception mark.At first respectively reference voice and tested speech are divided frame and added the Hanning window.Then with each frame voice through equally distributed quarter window wave filter on the Mel frequency marking, obtain each quarter window wave filter output energy and logarithm value M (q):

M (q) = \ln [Σ_{n = F_{q - 1}}^{F_{q}} \frac{n - F_{q - 1}}{F_{q} - F_{q - 1}} G (n) + Σ_{n = F_{q}}^{F_{q + 1}} \frac{F_{q + 1} - n}{F_{q + 1} - F_{q}} G (n)], - - - (5)

q＝1，2，3…，Q

Wherein, F _qBe the centre frequency of q quarter window wave filter, F _Q+1And F _Q-1Be respectively the upper and lower cutoff frequency of q quarter window wave filter, G (n) is the quarter window function, and Q is the number of quarter window wave filter.Common Q=20～26.

According to the power law in the psychology, the logarithm energy that each quarter window wave filter is exported can be mapped on the loudness domain, is calculated as follows shown in the formula:

L(q)＝0.048M(q) ^0.6 (6)

Wherein, M (q) is the logarithm energy of q wave filter output, and L (q) is the loudness that M (q) is mapped to the perception territory.

On the time-division alignment information basis of each phoneme that in based on described coupling fractional computation process, obtains, further utilize dynamic time warping (Dynamic Time Warping, DTW) method going on foot refinement frame by frame in phoneme aspect enterprising and aim at reference voice and tested speech.Here, the DTW method belongs to known technology to those skilled in the art, therefore omits the explanation to it.

After utilizing the DTW algorithm that reference voice and the every frame of tested speech are aimed at, just can calculate loudness difference D (q) in each quarter window output:

D(q)＝L _test(q)-L _ref(q)i＝1，2，3，…，Q (7)

Wherein, L _Test(q) and L _Ref(q) represent the loudness that tested speech and reference voice are exported respectively on q quarter window wave filter.

After obtaining the loudness difference of each quarter window wave filter output, need further total loudness of calculating on the whole M el frequency band poor, the loudness that just will calculate every frame voice is poor.The loudness of one frame voice can be weighted summation and obtain by the loudness difference to all quarter window outputs on the whole M el frequency band.The loudness difference p of the j frame voice of reference voice and tested speech _Frame(j) be:

p_{frame} (j) = Σ_{q = 1}^{Q} W (q) \sqrt{\frac{Σ_{i = 1}^{Q} {(D (q) W (q))}^{2}}{Σ_{i = 1}^{Q} W (q)}} - - - (8)

Wherein, D (q) is that reference voice and the loudness of tested speech in q critical band are poor, and W (q) is the bandwidth of q triangular filter.

It is poor that the perception mark of phoneme is defined as the frame mean loudness of reference voice and tested speech:

p_{phone} (i) \sqrt[6]{\frac{Σ_{j = 1}^{N} {[p_{frame} (j)]}^{6}}{N}} - - - (9)

Wherein, N is the frame number of the corresponding phoneme of longer voice in reference voice and the tested speech, p _Frame(j) be that the loudness of j frame is poor.Therefore, the perception mark p of whole sound pronunciation _{P_sen}Mean value for all phoneme loudness differences in the pronunciation:

p_{p_sen} = \frac{1}{N_{p}} Σ_{i = 1}^{N_{p}} p_{phone} (i) - - - (10)

Wherein, N _pBe the total number of phoneme in the whole sound pronunciation.

3, segment length's fractional computation:

Fig. 5 is the calculating synoptic diagram of segment length's mark.Based on the time-division information of each phoneme that obtains in the coupling fractional computation, and utilize segment length's model to calculate segment length's probability score of reference voice and each phoneme of tested speech respectively.The segment length's model that is adopted when calculating segment length's probability score can be histogram model or Gamma model.To those skilled in the art, this belongs to known technology.Therefore, omit detailed description thereof.

Segment length's mark d of phoneme _PhoneBe defined as the logarithmic difference of tested speech and reference voice segment length probability score:

d _phone＝|LogD _test-LogD _ref| (11)

D wherein _TestBe segment length's probability score of the corresponding phoneme of tested speech, D _RefSegment length's probability score for the corresponding phoneme of reference voice.

Segment length's mark d of whole sound pronunciation _SenBe defined as the mean value of all phoneme segment length marks:

d_{sen} = \frac{1}{N_{p}} Σ_{i = 1}^{N_{p}} d_{phone} (i) - - - (12)

4, fundamental tone fractional computation:

Fig. 6 is the calculating synoptic diagram of fundamental tone mark.Extract the fundamental tone of reference voice and tested speech at first, respectively.Existing multiple fundamental tone extracting method in the prior art.Take all factors into consideration the factors such as accuracy of algorithm complex, robustness, fundamental tone estimation, this paper adopts the auto-correlation algorithm for estimating based on the lpc analysis of linear predictive coding.The time-division information of each phoneme that obtains in the fractional computation in conjunction with coupling is then calculated the poor of the fundamental tone maximum value in each vowel and minimal value in reference voice and the tested speech respectively, and promptly the fundamental tone extreme value difference in the vowel is defined as:

S _vow(i)＝P _max(i)-P _min(i) (13)

P wherein _Max(i) and P _Min(i) represent the maximum value and the minimal value of the fundamental tone in i the vowel respectively.

Fundamental tone mark R _{Vow_max_min}Be defined as:

R_{vow_\max_\min} = \frac{1}{N_{v}} Σ_{i = 1}^{N_{v}} {| S_{vow}^{test} (i) - S_{vow}^{ref} (i) |}^{2} - - - (14)

N wherein _vBe the vowel sum in the sentence, S _Vow ^Test(i) be that the fundamental tone extreme value in i vowel is poor in the tested speech, S _Vow ^Ref(i) be that the fundamental tone extreme value in i vowel is poor in the reference voice.

5, mark mapping and mark merge:

Fig. 7 is the mark mapping and merges the calculating synoptic diagram.Earlier the machine mark is shone upon among the figure, adopt linear weighted function or SVM that the machine mark after shining upon is merged then, obtain final objective score.

(1) mapping method of machine mark: after calculating coupling mark, perception mark, segment length's mark and fundamental tone mark respectively, these four marks at first need be carried out the mark mapping.The interval of the machine mark that distinct methods draws is also inequality usually.Therefore need utilize mapping function that the machine mark is mapped to the expert marks in the corresponding to corresponding interval.Can shine upon described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark based on one in Sigmoid function, polynomial function or the linear function.The most simple and effective mapping method can adopt the cubic polynomial function to shine upon.Optimizing criterion in the mapping is minimum mean square error criterion, by mapping function the machine mark is mapped to expert's interval of marking.

y＝a ₁x ³+a ₂x ²+a ₃x+a ₄ (15)

Wherein, x is the original machine mark, and y is the machine mark after shining upon, a ₁, a ₂, a ₃And a ₄Be multinomial coefficient.

(2) method of mark fusion: have multiple information fusion disposal route in the existing signalling technique, for example can adopt linear function, neural network, gauss hybrid models, support vector machine, Logistic to return, and other are suitable for method that multiple different marks are merged.The present invention mainly adopts linear function and support vector machine that above-mentioned coupling mark, perception mark, segment length's mark and fundamental tone mark are merged.

If machine mark and expert's scoring can be regarded the Gaussian random variable of joint distribution as, perhaps there is linear relationship between the two, the mark after merging so can be expressed as the linear combination of machine mark:

Wherein, s ₁, s ₂..., s _nRepresent each machine mark, a ₁, a ₂..., a _nBe combination coefficient.These combination coefficients can be determined by the data based minimum mean square error criterion of exploitation collection.

The fusion method of SVM has general Software tool to use, based on the fusion of SVM on performance because the linear method that merges.The SVM fusion method belongs to known technology to those skilled in the art, therefore omits the explanation to it.

In the evaluation of voice quality, logical conventional computer is carried out automatic Evaluation to voice quality and the mark (being commonly referred to the machine mark) that obtains and expert represent the performance of pronunciation quality evaluating method to the related coefficient between the evaluation score of same pronunciation, as the formula (17).Usually, related coefficient is high more, the machine mark is described more near expert's mark, thereby performance is good more.

C_{corr} = \frac{Σ (x_{i} - \overset{&OverBar;}{x}) (y_{i} - \overset{&OverBar;}{y})}{\sqrt{Σ {(x_{i} - \overset{&OverBar;}{x})}^{2} Σ {(y_{i} - \overset{&OverBar;}{y})}^{2}}} - - - (17)

X wherein _iAnd y _iBe respectively the machine evaluation score and the corresponding expert opinion mark of i word or statement,

With Be respectively the average of machine evaluation score of all tested speech and the average of expert opinion scoring.

This evaluation procedure need be gathered the evaluation sound bank of certain scale, at first please the expert carry out subjective assessment to voice in the storehouse, estimates with machine then.Carry out the degree of correlation between computing machine evaluation and the expert opinion by formula (7).The present invention is directed to the voice quality machine evaluation score of word and short sentence and the related coefficient of expert's scoring and reach 0.800, its performance is better than the traditional evaluation method based on HMM.

Claims

1, sound pronunciation quality evaluating method in a kind of computer auxiliary language learning system of the present invention's proposition, comprise: coupling fractional computation, tin perception fractional computation based on Mei Er (Mel) frequency marking, fundamental tone fractional computation, mark mapping, mark merge each several part, and concrete calculating may further comprise the steps:

Step (1) is carried out the processing of branch frame respectively to reference voice and tested speech at first respectively, obtains dividing stably in short-term the frame voice;

Step (2) is according to the reference voice of the branch frame described in the following steps difference calculation procedures (1) and the match likelihood mark of tested speech;

Step (2.1) is extracted Mei Er frequency marking cepstrum coefficient (MFCC) feature to the reference voice and the every frame of tested speech of described minute frame respectively, totally 39 dimensional features, comprising: 12 dimension MFCC coefficient and first order difference and second order differences, normalized energy and first order difference thereof and second order difference;

Step (2.2) is utilized the good hidden Markov model of training in advance (HMM), adopt Viterbi (Viterbi) decoding algorithm respectively the reference voice and the tested speech of step (2.1) input to be carried out forced alignment, obtain the likelihood score of reference voice and tested speech respectively, and the time-division information of each phoneme in the voice;

Step (2.3) is calculated the coupling mark L (i) of i phoneme according to following formula:

L(i)＝|log(p _text(O _test|q _i))-log(p _ref(O _ref|q _i))|

Wherein, p _Test(O _Test| q _i) be the likelihood score of tested speech, p _Ref(O _Ref| q _i) be the likelihood score of reference voice.Wherein, in, q _iRepresent i phoneme HMM model, O _TestAnd O _RefIt is respectively the MFCC eigenvector of tested speech and reference voice.

Step (2.4) is calculated phoneme according to following formula and is on average mated mark, and successively as the coupling mark S of sound pronunciation _{Mat_sen}:

S_{mat_sen} = \frac{1}{N_{p}} Σ_{i = 1}^{N_{p}} L (i)

Wherein, N _pBe the total number of phoneme in the sound pronunciation;

Step (3) is according to the reference voice of the branch frame described in the following steps difference calculation procedures (1) and the perception mark of tested speech;

Step (3.1) is divided frame and is added the Hanning window described reference voice and tested speech respectively;

Step (3.2) is made a gift to someone the voice that divide frame in the step (3.1), and equally distributed Q quarter window wave filter carries out the Mel Filtering Processing on the Mel frequency marking, according to following formula obtain energy that each wave filter exports and logarithm value M (q):

M (q) = \ln [Σ_{n = F_{q - 1}}^{F_{q}} \frac{n - F_{q - 1}}{F_{q} - F_{q - 1}} G (n) + Σ_{n = F_{q}}^{F_{q + 1}} \frac{F_{q + 1} - n}{F_{q + 1} - F_{q}} G (n)]

Wherein, F _qBe the centre frequency of q quarter window wave filter, F _Q+1And F _Q-1Be respectively the upper and lower cutoff frequency of q quarter window wave filter, G (n) is the quarter window function, and Q is the number of quarter window wave filter, q=1,2,3 ..., Q;

Energy and logarithm value M (q) that q the quarter window wave filter that step (3.3) obtains step (3.2) according to following formula exported are mapped to the loudness L (q) that listens the perception territory:

L(q)＝0.048M(q) ^0.6

The time-division information of each phoneme that step (3.4) obtains based on step (2.2), utilize dynamic time programming algorithm (DTW) that reference voice and the corresponding phoneme of tested speech are aimed on the phoneme aspect frame by frame, and calculate reference voice and the loudness difference D (q) of tested speech on the loudness difference perception territory of q quarter window output:

D(q)＝L _test(q)-L _ref(q) i＝1，2，3，…，Q

L _Test(q) be the loudness of tested speech q quarter window filtering output; L _Ref(q) and reference voice in the loudness of q quarter window filtering output.

Step (3.5) is calculated the loudness difference p of every frame voice according to following formula _Frame(j):

p_{frame} (j) = Σ_{q = 1}^{Q} W (q) \sqrt{\frac{Σ_{i = 1}^{Q} {(D (q) W (q))}^{2}}{Σ_{i = 1}^{Q} W (q)}}

W (q) is the bandwidth of q triangular filter, and Q is the number of quarter window wave filter;

Step (3.6) is calculated as follows the perception mark p of i phoneme _Phone(i), the perception mark of phoneme is that the frame mean loudness of reference voice and tested speech is poor:

p_{phone} (i) = \sqrt[6]{\frac{Σ_{j = 1}^{N} [p_{frame} (j)]^{6}}{N}}

Wherein N is the frame number of the corresponding phoneme of longer voice in reference voice and the tested speech;

Step (3.7) is calculated as follows the perception mark p of whole sound pronunciation _{P_sen}:

p_{p_sen} = \frac{1}{N_{p}} Σ_{i = 1}^{N_{p}} p_{phone} (i)

N wherein _pBe the total number of phoneme in the sound pronunciation;

Step (4) is calculated segment length's mark of whole sound pronunciation according to the following steps:

Step (4.1) obtains the time-division information of each phoneme based on step (2.2), utilizes segment length's model to calculate segment length's probability score of reference voice and each phoneme of tested speech respectively.Segment length's model adopts histogram model or Gamma Model Calculation, is obtained by study in advance by the received pronunciation storehouse;

Step (4.2) is calculated as follows phoneme segment length mark d _Phone:

d _phone＝|LogD _test-LogD _ref|

Step (4.3) is calculated segment length's mark d of whole sound pronunciation according to following formula _Sen:

d_{sen} = \frac{1}{N_{p}} Σ_{i = 1}^{N_{p}} d_{phone} (i)

D wherein _Phone(i) be the logarithm segment length probability score of i phoneme in the sound pronunciation;

Step (5) is calculated the fundamental tone mark of whole sound pronunciation according to the following steps:

Obtain the time-division information of each phoneme in step (5.1) integrating step (2.2), employing is based on the auto-correlation algorithm for estimating in the linear predictive coding (LPC), calculates in reference voice and the tested speech the fundamental tone maximum value in i the vowel and the difference S of minimal value respectively _Vow(i):

S _vow(i)＝P _max(i)-P _min(i)

Step (5.2) is calculated fundamental tone mark R according to following formula _{Vow_max_min}:

R_{vow_\max_\min} = \frac{1}{N_{v}} Σ_{i = 1}^{N_{v}} {| S_{vow}^{test} (i) - S_{vow}^{ref} (i) |}^{2}

Step (6) is calculated the fusion mark of the pronunciation quality evaluating of whole voice according to the following steps, and the mark that merges comprises coupling mark, perception mark, segment length's mark and fundamental tone mark:

Step (6.1) is mapped to the expert by mapping function with the original machine evaluation score and marks in the interval, is calculated as follows mapping back machine mark:

y＝a ₁x ³+a ₂x ²+a ₃x+a ₄ (15)

Wherein, x is the original machine mark, and y is the machine mark after shining upon, a ₁, a ₂, a ₃And a ₄Be multinomial coefficient;

Step (6.2) is based on the fusion mark of the pronunciation quality evaluating of the whole voice of linear calculating fusion

Computing formula is as follows:

Wherein, s ₁, s ₂..., s _nRepresent each machine mark, a ₁, a ₂..., a _nBe combination coefficient;

If adopt the mark that carries out of support vector machine (SVM) to merge, can utilize general SVM Software tool bag to calculate and merge mark

Be better than the linear method that merges based on the SVM syncretizing effect;

2, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1 is characterized in that utilizing the conventional HMM method to carry out time alignment and coupling fractional computation; Utilize the Viterbi decoding algorithm respectively reference voice and tested speech to be carried out forced alignment, the time-division information of reference voice that obtains respectively and tested speech comprises the time-division of state, the time-division information of phoneme, the time-division information of word.

3, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that having proposed listening perception territory fractional computation method based on the Mel frequency marking, this method is different from traditional based on critical band perception fractional computation method, new method complexity on calculating is low, all is better than tin perception fractional computation method based on critical band on the performance.

4, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that proposing to utilize teacher's reference voice as the pronunciation quality evaluating reference template, this method is different from the HMM model score matching process of tradition based on extensive training utterance storehouse, new method makes full use of teacher's reference voice information, helps pronouncing information evaluation on the middle and senior level.

5, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that on based on the time-division alignment information basis of mating each phoneme that obtains in the fractional computation process, further utilize dynamic time warping (Dynamic Time Warping, DTW) method makes the voice of aligning that better comparability be arranged on frame the aiming at frame by frame on the phoneme aspect of reference voice and tested speech.

6, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that histogram model or Gamma model that described segment length's model is the segment length, segment length's mark is that the difference according to segment length's probability absolute value of segment length's probability of tested speech and received pronunciation obtains.

7, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1 is characterized in that described fundamental tone mark is based on that the difference of maximum value and minimal value calculates in each vowel in reference voice and the tested speech.

8, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that utilizing the multiple machine evaluation score in the sound pronunciation, and adopt and based on one in Sigmoid function, polynomial function or the linear function described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark to be shone upon, mapping back mark is in the identical interval with expert's scoring.

9, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1 is characterized in that described perception mark, described fundamental tone mark and described segment length's mark after the mapping are merged; One in the linear fusion of employing, support vector machine (SVM), Logistic recurrence (Logistic Regression), neural network, the gauss hybrid models is carried out mark to described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark after shining upon and merges.

10, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that the present invention to those skilled in the art, can require 1 calculation procedure of describing to carry out some little modification and modification to the present invention, under the situation that does not deviate from the spirit and scope of the present invention, these modifications and modification are also contained in the present invention.