CN101246685A - Pronunciation quality evaluation method of computer auxiliary language learning system - Google Patents

Pronunciation quality evaluation method of computer auxiliary language learning system Download PDF

Info

Publication number
CN101246685A
CN101246685A CNA200810102076XA CN200810102076A CN101246685A CN 101246685 A CN101246685 A CN 101246685A CN A200810102076X A CNA200810102076X A CN A200810102076XA CN 200810102076 A CN200810102076 A CN 200810102076A CN 101246685 A CN101246685 A CN 101246685A
Authority
CN
China
Prior art keywords
mark
phoneme
tested speech
voice
reference voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA200810102076XA
Other languages
Chinese (zh)
Other versions
CN101246685B (en
Inventor
刘加
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Huacong Zhijia Technology Co Ltd
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN200810102076XA priority Critical patent/CN101246685B/en
Publication of CN101246685A publication Critical patent/CN101246685A/en
Application granted granted Critical
Publication of CN101246685B publication Critical patent/CN101246685B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The present invention belongs to voice field of technology, pronunciation quality evaluation method of computer Computer Aided Language Learning system includes: calculation of the matching fraction, calculation of the sensing fraction based on Mel frequency scale, calculation of the segment length fraction and calculation of the keynote fraction, and processing fusion after mapping the above fractions; the pronunciation quality evaluation method of the invention has better robustness, high pertinency with the expert evaluation, used for interactive language learn and automatic spoken language test.

Description

Pronunciation quality evaluating method in the computer auxiliary language learning system
Technical field
The invention belongs to the voice technology field, specifically, relate to the method for utilizing voice process technology to realize sound pronunciation quality assessment in the computer auxiliary language learning system.
Background technology
When the learner carried out language learning, it was the Core Feature of computer auxiliary language learning system that its voice quality is estimated reliably.Yet because the limitation of prior art, the performance of present pronunciation quality evaluating method is desirable not enough, also has a certain distance apart from practicability.
Current, the method for utilizing computing machine that voice quality is estimated mainly is based on hidden Markov model (HMM).The Chinese invention patent application discloses a kind of pronunciation quality evaluating method that is used for learning machine No. 200510114848.8.This method adopts hidden Markov model training standard pronunciation model, and the search optimal path, calculates the confidence score that is used for estimating voice quality with this.This method too much depends on the training of hidden Markov model, and in influencing a plurality of factors of voice quality, only estimate voice quality based on the factor relevant with acoustic model, therefore the correlativity with expert's scoring is high not enough, and the correlativity that word and short sentence pronunciation machine scoring and expert are marked only is 0.74.
In communication system, also relate to the evaluation of voice quality.ITU-T discloses a kind of voice quality assessment method that is used for telephone channel in P.862.At first reference voice is obtained tested speech by telephone channel.Then reference voice and tested speech are mapped to the perception territory, accurately estimate the time-delay of tested speech, calculate the difference in perception of tested speech at last in the perception territory, and come the voice quality of evaluation test voice with this with respect to reference voice with respect to reference voice.
Yet voice quality assessment method and the pronunciation quality evaluating in the computer auxiliary language learning system in the communication system are different.At first, in voice communication system, the factor that influences voice quality generally is the various noises that cause of telephone channel, codec to the time-delay to voice of the damage of voice and network.Reference voice and tested speech all are same speaker's same sentence voice, if therefore do not consider time-delay, each phoneme in the tested speech does not generally have segment length's variation.Whether correct voice quality assessment process to communication system can not exert an influence people's sound pronunciation.And in computer auxiliary language learning system, influence the factor more complicated of voice quality.The distortion of learner's tested speech really causes owing to cacoepy and noise is had little or nothing to do with.And if with teacher's pronunciation as the reference voice, learner's voice are investigated the voice quality of tested speech with respect to reference voice as tested speech.Reference voice and tested speech cause the length of tested speech and reference voice different from different speakers so, and the difference of this voice length do not delay time and cause, and therefore can't directly aim at.Secondly, different speakers' sound channel length difference causes the resonance peak of same phoneme in tested speech and the reference voice not exclusively the same.In addition, the rhythm in two speaker's voice changes also different, its directly performance be exactly that the stress of tested speech and reference voice changes.Two people's fundamental tone is also inequality, and corresponding fundamental tone change procedure also has bigger difference.
Computer auxiliary language learning system should imitate the process of expert to pronunciation quality evaluating as far as possible.Usually, expert's process that voice quality is estimated can be divided into three steps.At first, listen to tested speech by earphone or audio amplifier.Then, brain is handled the voice that perceive, and according to oneself phonetics and linguistic knowledge, reference voice and tested speech is compared, and finds mispronounce and the distortion of tested speech on (for example acoustic layer and rhythmite) at all levels.At last, comprehensive above-mentioned various distortions provide the overall assessment of tested speech.As seen, closely related to the result of the perception of voice quality and pronunciation quality evaluating.And prior art fails on the rhythm level voice quality to be estimated, and lacks the research to the perceptual distortion aspect of voice quality.
The present invention is directed to problems of the prior art, propose the pronunciation quality evaluating method in a kind of computer auxiliary language learning system.This method with teacher's pronunciation as the reference voice, aspect acoustics, perception and the rhythm, calculate the voice quality difference of learner's tested speech respectively with respect to reference voice, obtain mating mark, perception mark, segment length's mark and fundamental tone mark, and described four kinds of marks are merged, obtain the final mark of tested speech.Related coefficient according to the pronunciation quality evaluating mark that the present invention is directed to word and short sentence and expert's scoring reaches 0.800, and performance is better than the method based on prior art.
Summary of the invention
The correlativity that present pronunciation quality evaluating method based on hidden Markov model obtains machine evaluation score and expert's scoring is high not enough, can not satisfy in the present computer auxiliary language learning system the pronunciation quality evaluating requirement.The objective of the invention is to overcome the deficiencies in the prior art, propose a kind of pronunciation quality evaluating method that is used for computer auxiliary language learning system.Propose to utilize teacher's reference voice and student's tested speech to calculate coupling mark, perception mark, segment length's mark, fundamental tone mark respectively from acoustics, perception and rhythm aspect among the present invention, and come method that voice quality is marked with merging after described these marks mapping, can reach 0.800 at the correlativity of the machine voice quality scoring of word and short sentence and the scoring of expert's subjective quality.
Pronunciation quality evaluating method in the computer auxiliary language learning system that the present invention proposes mainly comprises: calculate the coupling mark, calculate the perception mark based on the Mel frequency marking, calculate the fundamental tone mark, calculate segment length's mark; Described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark are shone upon, and each mark after will shining upon merges.This method makes full use of that multiple pronunciation information obtains reliably to merge mark in the sound pronunciation, thereby student's sound pronunciation quality is estimated (marking).It is characterized in that each several part specifically comprises following step:
1, the computing method of described coupling mark, described perception mark, described fundamental tone mark, described segment length's mark based on the Mel frequency marking, concrete steps are as follows:
(A) calculating of described coupling mark comprises: (1) utilizes the Viterbi decoding algorithm respectively tested speech and reference voice to be carried out forced alignment, obtains the time-division information and the likelihood score of the phoneme of reference voice and tested speech; (2) thoroughly deserve the coupling mark according to the difference of the likelihood score of tested speech and received pronunciation.
(B) calculating of described perception mark comprises: (1) applies the Mel wave filter to tested speech and reference voice; (2), the energy output of the Mel wave filter of reference voice and tested speech is mapped as loudness based on power law; (3) based on described phoneme time-division information, by dynamic time warping (Dynamic Time Warping, DTW) method is carried out further refinement frame by frame with reference voice and tested speech and is aimed on the phoneme aspect; (4) calculate the perception mark based on the loudness difference of reference voice and each frame of tested speech.
(C) calculating of described segment length's mark comprises: (1) utilizes the logarithm segment length probability in segment length's Model Calculation tested speech and the received pronunciation based on described time-division information; (2) difference according to segment length's probability absolute value of segment length's probability of tested speech and received pronunciation obtains segment length's mark.
(D) calculating of described fundamental tone mark comprises: (1) extracts the fundamental tone of reference voice and tested speech respectively; (2), obtain the maximum value and the minimal value of the interior fundamental tone of each vowel in reference voice and the tested speech respectively, and calculate the poor of interior maximum value of each vowel and minimal value based on described time-division information; (3) poor based on maximum value and minimal value in each vowel in reference voice and the tested speech.
2, described mark mapping is calculated and is comprised: based on one in Sigmoid function, polynomial function or the linear function described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark are shone upon, mapping back mark is in the identical interval with expert's scoring.
3, described mark merges to calculate and comprises: linear fusions, support vector machine (SVM), Logistic return in (LogisticRegression), neural network, the gauss hybrid models one described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark after shining upon are merged, by the complementation of multiple information, make and merge the back mark more near the mark of expert opinion.
Described Viterbi decoding algorithm has utilized from what extensive speech database was trained and based on hidden Markov model (HMM) tested speech and received pronunciation has been carried out forced alignment, and the time-division information of reference voice that obtains and tested speech can be the time-division of the time-division information of phoneme, state, the time-division information of word.The time-division information of phoneme is one of them.The training method of HMM model is according to maximum-likelihood criterion, and HMM model parameter (bag average and variance) is carried out valuation.
Described segment length's model is histogram model or the Gamma model that obtains the segment length by extensive speech database training.
It is on based on the time-division information basis of mating each phoneme that obtains in the fractional computation process that described refinement is frame by frame aimed at, further utilize dynamic time warping (Dynamic Time Warping, DTW) method is aimed at the refinement frame by frame of carrying out of reference voice and tested speech on the phoneme aspect, makes the voice of aligning that better comparability be arranged on frame.
The present invention proposes pronunciation quality evaluating method in the computer auxiliary language learning system when learner's voice quality is estimated, and performance is better than the level of prior art.It is good that pronunciation quality evaluating method of the present invention has robustness, with expert's high advantage of correlativity of marking, can be used for language learner and realize in interacting language learning pronunciation quality evaluating and the automatic speech test macro.
The present invention has following advantage:
(1) the present invention has made full use of teacher's reference voice and student's tested speech pronunciation difference characteristics and has estimated;
(2) the perception fractional computation complexity based on the Mel frequency marking of the present invention's proposition is lower than the perception fractional computation method based on critical band, and performance is better;
(3) the present invention has made full use of the multiple evaluation information in the pronunciation, match information, perception information, segment length's information, Pitch Information, and carried out information fusion, at different marks various pronunciation information are carried out complementation, improved the robustness of estimating, and with the correlativity of expert's scoring;
(4) of the present inventionly also can be applied to multilingual study based on the pronunciation evaluation method in the computer auxiliary language learning system, it is good to have robustness, with expert's high characteristics of correlativity of marking, and the present invention can realize on present palm PC, PDA(Personal Digital Assistant) or learning machine that its range of application is very extensive.
Description of drawings
Fig. 1 is the general illustration of pronunciation quality evaluating method;
Fig. 2 is the calculating synoptic diagram of coupling mark;
Fig. 3 is a HMM model topology structure;
Fig. 4 is the calculating synoptic diagram of perception mark;
Fig. 5 is the calculating synoptic diagram of segment length's model;
Fig. 6 is the calculating synoptic diagram of fundamental tone mark;
Fig. 7 machine mark merges synoptic diagram.
Embodiment
Embodiment to the pronunciation quality evaluating method that is used for computer-assisted language learning of the present invention's proposition is elaborated below in conjunction with accompanying drawing.Fig. 1 is the overview flow chart according to pronunciation quality evaluating method of the present invention.(1) at first reference voice and tested speech go out to mate mark, perception mark, segment length's mark and fundamental tone mark through acoustic model, sensor model, segment length's model and fundamental tone Model Calculation respectively.(2) mark of these being described the voice quality of aspects such as acoustics, perception and the rhythm respectively carries out the mark fusion.(3) with the mark after merging the voice quality of tested speech is estimated.
Reference voice is meant the Received Pronunciation as the teacher of the benchmark of pronunciation quality evaluating, and tested speech is meant the voice as the learner of the evaluation object of voice quality.Therefore, in pronunciation quality evaluating method of the present invention, need to calculate tested speech with respect to the pronunciation of reference voice in qualitative difference.The whole computation process details of the embodiment of the invention is constructed as follows:
1, coupling fractional computation:
Fig. 2 is the synoptic diagram of coupling mark.At first respectively reference voice and tested speech are carried out the processing of branch frame, obtain dividing stably in short-term the frame voice.Then every frame voice are extracted Mei Er frequency marking cepstrum coefficient (MFCC) feature.Wherein, the MFCC feature that every frame voice are extracted comprises 39 dimensions, 12 ties up MFCC coefficient and first order difference and second order differences, normalized energy and first order difference thereof and second order difference that is:.The MFCC feature has reflected the static nature of voice, and the single order of MFCC and second order difference coefficient have then reflected the behavioral characteristics of voice.Utilize the hidden Markov model (HMM) that trains then, adopt the Viterbi decoding algorithm respectively reference voice and tested speech to be carried out forced alignment, obtain the time-division information of likelihood mark He each phoneme of reference voice and tested speech.Here, the training process of HMM belongs to known technology to those skilled in the art, and is therefore only briefly bright to its work here.HMM adopts state transition model from left to right, and this model can be described the pronunciation characteristic of voice well.For example available employing 3 state hidden Markov models, its topological structure as shown in Figure 3.Q wherein iThe state of expression HMM, a IjThe redirect probability of expression HMM, b j(O t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model, as shown in Equation (1):
b j ( O t ) = Π s = 1 S [ Σ m = 1 M S C jsm N ( O st ; μ jsm ; φ jsm ) ] γ s - - - ( 1 )
Wherein, S is the number of data stream, M sBe the number of the mixed Gaussian Density Distribution in each data stream, N is the higher-dimension Gaussian distribution, as shown in Equation (2):
N ( o ; μ ; φ ) = 1 ( 2 π ) n | φ | e - 1 2 ( o - μ ) φ - 1 ( o - μ ) - - - ( 2 )
Tested speech and reference voice are to be made of a plurality of phonemes.After respectively reference voice and tested speech being carried out forced alignment, the coupling mark L (i) of i phoneme is provided by following formula:
L(i)=|log(p text(O test|q i))-log(p ref(O ref|q i))| (3)
Wherein, p Test(O Test| q i) be the likelihood score of tested speech, p Ref(O Ref| q i) be the likelihood score of reference voice.Wherein, q iRepresent i phoneme HMM model, O TestAnd O RefIt is respectively the MFCC eigenvector of tested speech and reference voice.
The coupling mark is defined as phoneme and on average mates mark:
S mat _ sen = 1 N p Σ i = 1 N p L ( i ) - - - ( 4 )
Wherein, N pBe the total number of phoneme in the sound pronunciation, L (i) is the coupling mark of i phoneme.
2, perception fractional computation:
Fig. 4 is the calculating synoptic diagram of perception mark.At first respectively reference voice and tested speech are divided frame and added the Hanning window.Then with each frame voice through equally distributed quarter window wave filter on the Mel frequency marking, obtain each quarter window wave filter output energy and logarithm value M (q):
M ( q ) = ln [ Σ n = F q - 1 F q n - F q - 1 F q - F q - 1 G ( n ) + Σ n = F q F q + 1 F q + 1 - n F q + 1 - F q G ( n ) ] , - - - ( 5 )
q=1,2,3…,Q
Wherein, F qBe the centre frequency of q quarter window wave filter, F Q+1And F Q-1Be respectively the upper and lower cutoff frequency of q quarter window wave filter, G (n) is the quarter window function, and Q is the number of quarter window wave filter.Common Q=20~26.
According to the power law in the psychology, the logarithm energy that each quarter window wave filter is exported can be mapped on the loudness domain, is calculated as follows shown in the formula:
L(q)=0.048M(q) 0.6 (6)
Wherein, M (q) is the logarithm energy of q wave filter output, and L (q) is the loudness that M (q) is mapped to the perception territory.
On the time-division alignment information basis of each phoneme that in based on described coupling fractional computation process, obtains, further utilize dynamic time warping (Dynamic Time Warping, DTW) method going on foot refinement frame by frame in phoneme aspect enterprising and aim at reference voice and tested speech.Here, the DTW method belongs to known technology to those skilled in the art, therefore omits the explanation to it.
After utilizing the DTW algorithm that reference voice and the every frame of tested speech are aimed at, just can calculate loudness difference D (q) in each quarter window output:
D(q)=L test(q)-L ref(q)i=1,2,3,…,Q (7)
Wherein, L Test(q) and L Ref(q) represent the loudness that tested speech and reference voice are exported respectively on q quarter window wave filter.
After obtaining the loudness difference of each quarter window wave filter output, need further total loudness of calculating on the whole M el frequency band poor, the loudness that just will calculate every frame voice is poor.The loudness of one frame voice can be weighted summation and obtain by the loudness difference to all quarter window outputs on the whole M el frequency band.The loudness difference p of the j frame voice of reference voice and tested speech Frame(j) be:
p frame ( j ) = Σ q = 1 Q W ( q ) Σ i = 1 Q ( D ( q ) W ( q ) ) 2 Σ i = 1 Q W ( q ) - - - ( 8 )
Wherein, D (q) is that reference voice and the loudness of tested speech in q critical band are poor, and W (q) is the bandwidth of q triangular filter.
It is poor that the perception mark of phoneme is defined as the frame mean loudness of reference voice and tested speech:
p phone ( i ) Σ j = 1 N [ p frame ( j ) ] 6 N 6 - - - ( 9 )
Wherein, N is the frame number of the corresponding phoneme of longer voice in reference voice and the tested speech, p Frame(j) be that the loudness of j frame is poor.Therefore, the perception mark p of whole sound pronunciation P_senMean value for all phoneme loudness differences in the pronunciation:
p p _ sen = 1 N p Σ i = 1 N p p phone ( i ) - - - ( 10 )
Wherein, N pBe the total number of phoneme in the whole sound pronunciation.
3, segment length's fractional computation:
Fig. 5 is the calculating synoptic diagram of segment length's mark.Based on the time-division information of each phoneme that obtains in the coupling fractional computation, and utilize segment length's model to calculate segment length's probability score of reference voice and each phoneme of tested speech respectively.The segment length's model that is adopted when calculating segment length's probability score can be histogram model or Gamma model.To those skilled in the art, this belongs to known technology.Therefore, omit detailed description thereof.
Segment length's mark d of phoneme PhoneBe defined as the logarithmic difference of tested speech and reference voice segment length probability score:
d phone=|LogD test-LogD ref| (11)
D wherein TestBe segment length's probability score of the corresponding phoneme of tested speech, D RefSegment length's probability score for the corresponding phoneme of reference voice.
Segment length's mark d of whole sound pronunciation SenBe defined as the mean value of all phoneme segment length marks:
d sen = 1 N p Σ i = 1 N p d phone ( i ) - - - ( 12 )
4, fundamental tone fractional computation:
Fig. 6 is the calculating synoptic diagram of fundamental tone mark.Extract the fundamental tone of reference voice and tested speech at first, respectively.Existing multiple fundamental tone extracting method in the prior art.Take all factors into consideration the factors such as accuracy of algorithm complex, robustness, fundamental tone estimation, this paper adopts the auto-correlation algorithm for estimating based on the lpc analysis of linear predictive coding.The time-division information of each phoneme that obtains in the fractional computation in conjunction with coupling is then calculated the poor of the fundamental tone maximum value in each vowel and minimal value in reference voice and the tested speech respectively, and promptly the fundamental tone extreme value difference in the vowel is defined as:
S vow(i)=P max(i)-P min(i) (13)
P wherein Max(i) and P Min(i) represent the maximum value and the minimal value of the fundamental tone in i the vowel respectively.
Fundamental tone mark R Vow_max_minBe defined as:
R vow _ max _ min = 1 N v Σ i = 1 N v | S vow test ( i ) - S vow ref ( i ) | 2 - - - ( 14 )
N wherein vBe the vowel sum in the sentence, S Vow Test(i) be that the fundamental tone extreme value in i vowel is poor in the tested speech, S Vow Ref(i) be that the fundamental tone extreme value in i vowel is poor in the reference voice.
5, mark mapping and mark merge:
Fig. 7 is the mark mapping and merges the calculating synoptic diagram.Earlier the machine mark is shone upon among the figure, adopt linear weighted function or SVM that the machine mark after shining upon is merged then, obtain final objective score.
(1) mapping method of machine mark: after calculating coupling mark, perception mark, segment length's mark and fundamental tone mark respectively, these four marks at first need be carried out the mark mapping.The interval of the machine mark that distinct methods draws is also inequality usually.Therefore need utilize mapping function that the machine mark is mapped to the expert marks in the corresponding to corresponding interval.Can shine upon described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark based on one in Sigmoid function, polynomial function or the linear function.The most simple and effective mapping method can adopt the cubic polynomial function to shine upon.Optimizing criterion in the mapping is minimum mean square error criterion, by mapping function the machine mark is mapped to expert's interval of marking.
y=a 1x 3+a 2x 2+a 3x+a 4 (15)
Wherein, x is the original machine mark, and y is the machine mark after shining upon, a 1, a 2, a 3And a 4Be multinomial coefficient.
(2) method of mark fusion: have multiple information fusion disposal route in the existing signalling technique, for example can adopt linear function, neural network, gauss hybrid models, support vector machine, Logistic to return, and other are suitable for method that multiple different marks are merged.The present invention mainly adopts linear function and support vector machine that above-mentioned coupling mark, perception mark, segment length's mark and fundamental tone mark are merged.
If machine mark and expert's scoring can be regarded the Gaussian random variable of joint distribution as, perhaps there is linear relationship between the two, the mark after merging so can be expressed as the linear combination of machine mark:
Figure S200810102076XD00091
Wherein, s 1, s 2..., s nRepresent each machine mark, a 1, a 2..., a nBe combination coefficient.These combination coefficients can be determined by the data based minimum mean square error criterion of exploitation collection.
The fusion method of SVM has general Software tool to use, based on the fusion of SVM on performance because the linear method that merges.The SVM fusion method belongs to known technology to those skilled in the art, therefore omits the explanation to it.
In the evaluation of voice quality, logical conventional computer is carried out automatic Evaluation to voice quality and the mark (being commonly referred to the machine mark) that obtains and expert represent the performance of pronunciation quality evaluating method to the related coefficient between the evaluation score of same pronunciation, as the formula (17).Usually, related coefficient is high more, the machine mark is described more near expert's mark, thereby performance is good more.
C corr = Σ ( x i - x ‾ ) ( y i - y ‾ ) Σ ( x i - x ‾ ) 2 Σ ( y i - y ‾ ) 2 - - - ( 17 )
X wherein iAnd y iBe respectively the machine evaluation score and the corresponding expert opinion mark of i word or statement,
Figure S200810102076XD00093
With Be respectively the average of machine evaluation score of all tested speech and the average of expert opinion scoring.
This evaluation procedure need be gathered the evaluation sound bank of certain scale, at first please the expert carry out subjective assessment to voice in the storehouse, estimates with machine then.Carry out the degree of correlation between computing machine evaluation and the expert opinion by formula (7).The present invention is directed to the voice quality machine evaluation score of word and short sentence and the related coefficient of expert's scoring and reach 0.800, its performance is better than the traditional evaluation method based on HMM.

Claims (10)

1, sound pronunciation quality evaluating method in a kind of computer auxiliary language learning system of the present invention's proposition, comprise: coupling fractional computation, tin perception fractional computation based on Mei Er (Mel) frequency marking, fundamental tone fractional computation, mark mapping, mark merge each several part, and concrete calculating may further comprise the steps:
Step (1) is carried out the processing of branch frame respectively to reference voice and tested speech at first respectively, obtains dividing stably in short-term the frame voice;
Step (2) is according to the reference voice of the branch frame described in the following steps difference calculation procedures (1) and the match likelihood mark of tested speech;
Step (2.1) is extracted Mei Er frequency marking cepstrum coefficient (MFCC) feature to the reference voice and the every frame of tested speech of described minute frame respectively, totally 39 dimensional features, comprising: 12 dimension MFCC coefficient and first order difference and second order differences, normalized energy and first order difference thereof and second order difference;
Step (2.2) is utilized the good hidden Markov model of training in advance (HMM), adopt Viterbi (Viterbi) decoding algorithm respectively the reference voice and the tested speech of step (2.1) input to be carried out forced alignment, obtain the likelihood score of reference voice and tested speech respectively, and the time-division information of each phoneme in the voice;
Step (2.3) is calculated the coupling mark L (i) of i phoneme according to following formula:
L(i)=|log(p text(O test|q i))-log(p ref(O ref|q i))|
Wherein, p Test(O Test| q i) be the likelihood score of tested speech, p Ref(O Ref| q i) be the likelihood score of reference voice.Wherein, in, q iRepresent i phoneme HMM model, O TestAnd O RefIt is respectively the MFCC eigenvector of tested speech and reference voice.
Step (2.4) is calculated phoneme according to following formula and is on average mated mark, and successively as the coupling mark S of sound pronunciation Mat_sen:
S mat _ sen = 1 N p Σ i = 1 N p L ( i )
Wherein, N pBe the total number of phoneme in the sound pronunciation;
Step (3) is according to the reference voice of the branch frame described in the following steps difference calculation procedures (1) and the perception mark of tested speech;
Step (3.1) is divided frame and is added the Hanning window described reference voice and tested speech respectively;
Step (3.2) is made a gift to someone the voice that divide frame in the step (3.1), and equally distributed Q quarter window wave filter carries out the Mel Filtering Processing on the Mel frequency marking, according to following formula obtain energy that each wave filter exports and logarithm value M (q):
M ( q ) = ln [ Σ n = F q - 1 F q n - F q - 1 F q - F q - 1 G ( n ) + Σ n = F q F q + 1 F q + 1 - n F q + 1 - F q G ( n ) ]
Wherein, F qBe the centre frequency of q quarter window wave filter, F Q+1And F Q-1Be respectively the upper and lower cutoff frequency of q quarter window wave filter, G (n) is the quarter window function, and Q is the number of quarter window wave filter, q=1,2,3 ..., Q;
Energy and logarithm value M (q) that q the quarter window wave filter that step (3.3) obtains step (3.2) according to following formula exported are mapped to the loudness L (q) that listens the perception territory:
L(q)=0.048M(q) 0.6
The time-division information of each phoneme that step (3.4) obtains based on step (2.2), utilize dynamic time programming algorithm (DTW) that reference voice and the corresponding phoneme of tested speech are aimed on the phoneme aspect frame by frame, and calculate reference voice and the loudness difference D (q) of tested speech on the loudness difference perception territory of q quarter window output:
D(q)=L test(q)-L ref(q) i=1,2,3,…,Q
L Test(q) be the loudness of tested speech q quarter window filtering output; L Ref(q) and reference voice in the loudness of q quarter window filtering output.
Step (3.5) is calculated the loudness difference p of every frame voice according to following formula Frame(j):
p frame ( j ) = Σ q = 1 Q W ( q ) Σ i = 1 Q ( D ( q ) W ( q ) ) 2 Σ i = 1 Q W ( q )
W (q) is the bandwidth of q triangular filter, and Q is the number of quarter window wave filter;
Step (3.6) is calculated as follows the perception mark p of i phoneme Phone(i), the perception mark of phoneme is that the frame mean loudness of reference voice and tested speech is poor:
p phone ( i ) = Σ j = 1 N [ p frame ( j ) ] 6 N 6
Wherein N is the frame number of the corresponding phoneme of longer voice in reference voice and the tested speech;
Step (3.7) is calculated as follows the perception mark p of whole sound pronunciation P_sen:
p p _ sen = 1 N p Σ i = 1 N p p phone ( i )
N wherein pBe the total number of phoneme in the sound pronunciation;
Step (4) is calculated segment length's mark of whole sound pronunciation according to the following steps:
Step (4.1) obtains the time-division information of each phoneme based on step (2.2), utilizes segment length's model to calculate segment length's probability score of reference voice and each phoneme of tested speech respectively.Segment length's model adopts histogram model or Gamma Model Calculation, is obtained by study in advance by the received pronunciation storehouse;
Step (4.2) is calculated as follows phoneme segment length mark d Phone:
d phone=|LogD test-LogD ref|
D wherein TestBe segment length's probability score of the corresponding phoneme of tested speech, D RefSegment length's probability score for the corresponding phoneme of reference voice.
Step (4.3) is calculated segment length's mark d of whole sound pronunciation according to following formula Sen:
d sen = 1 N p Σ i = 1 N p d phone ( i )
D wherein Phone(i) be the logarithm segment length probability score of i phoneme in the sound pronunciation;
Step (5) is calculated the fundamental tone mark of whole sound pronunciation according to the following steps:
Obtain the time-division information of each phoneme in step (5.1) integrating step (2.2), employing is based on the auto-correlation algorithm for estimating in the linear predictive coding (LPC), calculates in reference voice and the tested speech the fundamental tone maximum value in i the vowel and the difference S of minimal value respectively Vow(i):
S vow(i)=P max(i)-P min(i)
P wherein Max(i) and P Min(i) represent the maximum value and the minimal value of the fundamental tone in i the vowel respectively.
Step (5.2) is calculated fundamental tone mark R according to following formula Vow_max_min:
R vow _ max _ min = 1 N v Σ i = 1 N v | S vow test ( i ) - S vow ref ( i ) | 2
N wherein vBe the vowel sum in the sentence, S Vow Test(i) be that the fundamental tone extreme value in i vowel is poor in the tested speech, S Vow Ref(i) be that the fundamental tone extreme value in i vowel is poor in the reference voice.
Step (6) is calculated the fusion mark of the pronunciation quality evaluating of whole voice according to the following steps, and the mark that merges comprises coupling mark, perception mark, segment length's mark and fundamental tone mark:
Step (6.1) is mapped to the expert by mapping function with the original machine evaluation score and marks in the interval, is calculated as follows mapping back machine mark:
y=a 1x 3+a 2x 2+a 3x+a 4 (15)
Wherein, x is the original machine mark, and y is the machine mark after shining upon, a 1, a 2, a 3And a 4Be multinomial coefficient;
Step (6.2) is based on the fusion mark of the pronunciation quality evaluating of the whole voice of linear calculating fusion
Figure S200810102076XC00041
Computing formula is as follows:
Figure S200810102076XC00042
Wherein, s 1, s 2..., s nRepresent each machine mark, a 1, a 2..., a nBe combination coefficient;
If adopt the mark that carries out of support vector machine (SVM) to merge, can utilize general SVM Software tool bag to calculate and merge mark
Figure S200810102076XC00043
Be better than the linear method that merges based on the SVM syncretizing effect;
2, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1 is characterized in that utilizing the conventional HMM method to carry out time alignment and coupling fractional computation; Utilize the Viterbi decoding algorithm respectively reference voice and tested speech to be carried out forced alignment, the time-division information of reference voice that obtains respectively and tested speech comprises the time-division of state, the time-division information of phoneme, the time-division information of word.
3, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that having proposed listening perception territory fractional computation method based on the Mel frequency marking, this method is different from traditional based on critical band perception fractional computation method, new method complexity on calculating is low, all is better than tin perception fractional computation method based on critical band on the performance.
4, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that proposing to utilize teacher's reference voice as the pronunciation quality evaluating reference template, this method is different from the HMM model score matching process of tradition based on extensive training utterance storehouse, new method makes full use of teacher's reference voice information, helps pronouncing information evaluation on the middle and senior level.
5, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that on based on the time-division alignment information basis of mating each phoneme that obtains in the fractional computation process, further utilize dynamic time warping (Dynamic Time Warping, DTW) method makes the voice of aligning that better comparability be arranged on frame the aiming at frame by frame on the phoneme aspect of reference voice and tested speech.
6, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that histogram model or Gamma model that described segment length's model is the segment length, segment length's mark is that the difference according to segment length's probability absolute value of segment length's probability of tested speech and received pronunciation obtains.
7, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1 is characterized in that described fundamental tone mark is based on that the difference of maximum value and minimal value calculates in each vowel in reference voice and the tested speech.
8, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that utilizing the multiple machine evaluation score in the sound pronunciation, and adopt and based on one in Sigmoid function, polynomial function or the linear function described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark to be shone upon, mapping back mark is in the identical interval with expert's scoring.
9, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1 is characterized in that described perception mark, described fundamental tone mark and described segment length's mark after the mapping are merged; One in the linear fusion of employing, support vector machine (SVM), Logistic recurrence (Logistic Regression), neural network, the gauss hybrid models is carried out mark to described coupling mark, described perception mark, described fundamental tone mark and described segment length's mark after shining upon and merges.
10, sound pronunciation quality evaluating method in the computer auxiliary language learning system as claimed in claim 1, it is characterized in that the present invention to those skilled in the art, can require 1 calculation procedure of describing to carry out some little modification and modification to the present invention, under the situation that does not deviate from the spirit and scope of the present invention, these modifications and modification are also contained in the present invention.
CN200810102076XA 2008-03-17 2008-03-17 Pronunciation quality evaluation method of computer auxiliary language learning system Active CN101246685B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200810102076XA CN101246685B (en) 2008-03-17 2008-03-17 Pronunciation quality evaluation method of computer auxiliary language learning system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200810102076XA CN101246685B (en) 2008-03-17 2008-03-17 Pronunciation quality evaluation method of computer auxiliary language learning system

Publications (2)

Publication Number Publication Date
CN101246685A true CN101246685A (en) 2008-08-20
CN101246685B CN101246685B (en) 2011-03-30

Family

ID=39947102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200810102076XA Active CN101246685B (en) 2008-03-17 2008-03-17 Pronunciation quality evaluation method of computer auxiliary language learning system

Country Status (1)

Country Link
CN (1) CN101246685B (en)

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101894560A (en) * 2010-06-29 2010-11-24 上海大学 Reference source-free MP3 audio frequency definition objective evaluation method
CN101996635A (en) * 2010-08-30 2011-03-30 清华大学 English pronunciation quality evaluation method based on accent highlight degree
CN101650886B (en) * 2008-12-26 2011-05-18 中国科学院声学研究所 Method for automatically detecting reading errors of language learners
CN101727903B (en) * 2008-10-29 2011-10-19 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN101739869B (en) * 2008-11-19 2012-03-28 中国科学院自动化研究所 Priori knowledge-based pronunciation evaluation and diagnosis system
CN103054586A (en) * 2012-12-17 2013-04-24 清华大学 Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
CN103151042A (en) * 2013-01-23 2013-06-12 中国科学院深圳先进技术研究院 Full-automatic oral language evaluating management and scoring system and scoring method thereof
CN104599680A (en) * 2013-10-30 2015-05-06 语冠信息技术(上海)有限公司 Real-time spoken language evaluation system and real-time spoken language evaluation method on mobile equipment
CN106531185A (en) * 2016-11-01 2017-03-22 上海语知义信息技术有限公司 Voice evaluation method and system based on voice similarity
CN106935236A (en) * 2017-02-14 2017-07-07 复旦大学 A kind of piano performance appraisal procedure and system
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107221343A (en) * 2017-05-19 2017-09-29 北京市农林科学院 The appraisal procedure and assessment system of a kind of quality of data
CN108877839A (en) * 2018-08-02 2018-11-23 南京华苏科技有限公司 The method and system of perceptual evaluation of speech quality based on voice semantics recognition technology
CN109496334A (en) * 2016-08-09 2019-03-19 华为技术有限公司 For assessing the device and method of voice quality
CN109686383A (en) * 2017-10-18 2019-04-26 腾讯科技(深圳)有限公司 A kind of speech analysis method, device and storage medium
CN109697988A (en) * 2017-10-20 2019-04-30 深圳市鹰硕音频科技有限公司 A kind of Speech Assessment Methods and device
CN109979486A (en) * 2017-12-28 2019-07-05 中国移动通信集团北京有限公司 A kind of speech quality assessment method and device
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method
CN111640452A (en) * 2019-03-01 2020-09-08 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111859681A (en) * 2020-07-24 2020-10-30 重庆大学 Linear structure damage identification method based on ARFIMA model
CN112017694A (en) * 2020-08-25 2020-12-01 天津洪恩完美未来教育科技有限公司 Voice data evaluation method and device, storage medium and electronic device
CN113571043A (en) * 2021-07-27 2021-10-29 广州欢城文化传媒有限公司 Dialect simulation force evaluation method and device, electronic equipment and storage medium
CN115662242A (en) * 2022-12-02 2023-01-31 首都医科大学附属北京儿童医院 Apparatus, device and storage medium for training children's language fluency
CN115798519A (en) * 2023-02-10 2023-03-14 山东山大鸥玛软件股份有限公司 English multi-question spoken language pronunciation assessment method and system

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103985391A (en) * 2014-04-16 2014-08-13 柳超 Phonetic-level low power consumption spoken language evaluation and defect diagnosis method without standard pronunciation
CN103985392A (en) * 2014-04-16 2014-08-13 柳超 Phoneme-level low-power consumption spoken language assessment and defect diagnosis method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7962327B2 (en) * 2004-12-17 2011-06-14 Industrial Technology Research Institute Pronunciation assessment method and system based on distinctive feature analysis
CN100411011C (en) * 2005-11-18 2008-08-13 清华大学 Pronunciation quality evaluating method for language learning machine
CN1787070B (en) * 2005-12-09 2011-03-16 北京凌声芯语音科技有限公司 On-chip system for language learner

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101727903B (en) * 2008-10-29 2011-10-19 中国科学院自动化研究所 Pronunciation quality assessment and error detection method based on fusion of multiple characteristics and multiple systems
CN101739869B (en) * 2008-11-19 2012-03-28 中国科学院自动化研究所 Priori knowledge-based pronunciation evaluation and diagnosis system
CN101650886B (en) * 2008-12-26 2011-05-18 中国科学院声学研究所 Method for automatically detecting reading errors of language learners
CN101894560B (en) * 2010-06-29 2012-08-15 上海大学 Reference source-free MP3 audio frequency definition objective evaluation method
CN101894560A (en) * 2010-06-29 2010-11-24 上海大学 Reference source-free MP3 audio frequency definition objective evaluation method
CN101996635A (en) * 2010-08-30 2011-03-30 清华大学 English pronunciation quality evaluation method based on accent highlight degree
CN103054586B (en) * 2012-12-17 2014-07-23 清华大学 Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
CN103054586A (en) * 2012-12-17 2013-04-24 清华大学 Chinese speech automatic audiometric method based on Chinese speech audiometric dynamic word list
CN103151042B (en) * 2013-01-23 2016-02-24 中国科学院深圳先进技术研究院 Full-automatic oral evaluation management and points-scoring system and methods of marking thereof
CN103151042A (en) * 2013-01-23 2013-06-12 中国科学院深圳先进技术研究院 Full-automatic oral language evaluating management and scoring system and scoring method thereof
CN104599680A (en) * 2013-10-30 2015-05-06 语冠信息技术(上海)有限公司 Real-time spoken language evaluation system and real-time spoken language evaluation method on mobile equipment
CN104599680B (en) * 2013-10-30 2019-11-26 语冠信息技术(上海)有限公司 Real-time spoken evaluation system and method in mobile device
CN109496334B (en) * 2016-08-09 2022-03-11 华为技术有限公司 Apparatus and method for evaluating speech quality
CN109496334A (en) * 2016-08-09 2019-03-19 华为技术有限公司 For assessing the device and method of voice quality
CN106531185A (en) * 2016-11-01 2017-03-22 上海语知义信息技术有限公司 Voice evaluation method and system based on voice similarity
CN106935236A (en) * 2017-02-14 2017-07-07 复旦大学 A kind of piano performance appraisal procedure and system
CN106971703A (en) * 2017-03-17 2017-07-21 西北师范大学 A kind of song synthetic method and device based on HMM
CN107221343A (en) * 2017-05-19 2017-09-29 北京市农林科学院 The appraisal procedure and assessment system of a kind of quality of data
CN109686383A (en) * 2017-10-18 2019-04-26 腾讯科技(深圳)有限公司 A kind of speech analysis method, device and storage medium
CN109686383B (en) * 2017-10-18 2021-03-23 腾讯科技(深圳)有限公司 Voice analysis method, device and storage medium
CN109697988A (en) * 2017-10-20 2019-04-30 深圳市鹰硕音频科技有限公司 A kind of Speech Assessment Methods and device
CN109697988B (en) * 2017-10-20 2021-05-14 深圳市鹰硕教育服务有限公司 Voice evaluation method and device
CN109979486A (en) * 2017-12-28 2019-07-05 中国移动通信集团北京有限公司 A kind of speech quality assessment method and device
CN109979486B (en) * 2017-12-28 2021-07-09 中国移动通信集团北京有限公司 Voice quality assessment method and device
CN108877839A (en) * 2018-08-02 2018-11-23 南京华苏科技有限公司 The method and system of perceptual evaluation of speech quality based on voice semantics recognition technology
CN108877839B (en) * 2018-08-02 2021-01-12 南京华苏科技有限公司 Voice quality perception evaluation method and system based on voice semantic recognition technology
CN111640452A (en) * 2019-03-01 2020-09-08 北京搜狗科技发展有限公司 Data processing method and device and data processing device
CN111640452B (en) * 2019-03-01 2024-05-07 北京搜狗科技发展有限公司 Data processing method and device for data processing
CN110047474A (en) * 2019-05-06 2019-07-23 齐鲁工业大学 A kind of English phonetic pronunciation intelligent training system and training method
CN111859681A (en) * 2020-07-24 2020-10-30 重庆大学 Linear structure damage identification method based on ARFIMA model
CN111859681B (en) * 2020-07-24 2023-10-03 重庆大学 Linear structure damage identification method based on ARFIMA model
CN112017694A (en) * 2020-08-25 2020-12-01 天津洪恩完美未来教育科技有限公司 Voice data evaluation method and device, storage medium and electronic device
CN112017694B (en) * 2020-08-25 2021-08-20 天津洪恩完美未来教育科技有限公司 Voice data evaluation method and device, storage medium and electronic device
CN113571043A (en) * 2021-07-27 2021-10-29 广州欢城文化传媒有限公司 Dialect simulation force evaluation method and device, electronic equipment and storage medium
CN115662242A (en) * 2022-12-02 2023-01-31 首都医科大学附属北京儿童医院 Apparatus, device and storage medium for training children's language fluency
CN115798519A (en) * 2023-02-10 2023-03-14 山东山大鸥玛软件股份有限公司 English multi-question spoken language pronunciation assessment method and system

Also Published As

Publication number Publication date
CN101246685B (en) 2011-03-30

Similar Documents

Publication Publication Date Title
CN101246685B (en) Pronunciation quality evaluation method of computer auxiliary language learning system
Shobaki et al. The OGI kids’ speech corpus and recognizers
CN100411011C (en) Pronunciation quality evaluating method for language learning machine
US9672816B1 (en) Annotating maps with user-contributed pronunciations
CN101645271B (en) Rapid confidence-calculation method in pronunciation quality evaluation system
Deshwal et al. Feature extraction methods in language identification: a survey
CN108922541B (en) Multi-dimensional characteristic parameter voiceprint recognition method based on DTW and GMM models
WO2019214047A1 (en) Method and apparatus for establishing voice print model, computer device, and storage medium
CN104575490A (en) Spoken language pronunciation detecting and evaluating method based on deep neural network posterior probability algorithm
CN103559892A (en) Method and system for evaluating spoken language
CN102214462A (en) Method and system for estimating pronunciation
CN107886968B (en) Voice evaluation method and system
Fukuda et al. Detecting breathing sounds in realistic Japanese telephone conversations and its application to automatic speech recognition
Mary et al. Searching speech databases: features, techniques and evaluation measures
CN104575495A (en) Language identification method and system adopting total variable quantity factors
Lin et al. Improving L2 English rhythm evaluation with automatic sentence stress detection
Shrawankar et al. Speech user interface for computer based education system
Khan et al. Automatic Arabic pronunciation scoring for computer aided language learning
Mengistu Automatic text independent amharic language speaker recognition in noisy environment using hybrid approaches of LPCC, MFCC and GFCC
Knill et al. Use of graphemic lexicons for spoken language assessment
Kiefte et al. Modeling consonant-context effects in a large database of spontaneous speech recordings
Luo et al. Automatic pronunciation evaluation of language learners' utterances generated through shadowing.
CN112767961B (en) Accent correction method based on cloud computing
Kasahara et al. Improved and robust prediction of pronunciation distance for individual-basis clustering of World Englishes pronunciation
Jambi et al. Speak-Correct: A Computerized Interface for the Analysis of Mispronounced Errors.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20181114

Address after: 100085 Beijing Haidian District Shangdi Information Industry Base Pioneer Road 1 B Block 2 Floor 2030

Patentee after: Beijing Huacong Zhijia Technology Co., Ltd.

Address before: 100084 mailbox 100084-82, Beijing City

Patentee before: Tsinghua University

TR01 Transfer of patent right