CN1763843A - Pronunciation quality evaluating method for language learning machine - Google Patents

Pronunciation quality evaluating method for language learning machine Download PDF

Info

Publication number
CN1763843A
CN1763843A CNA2005101148488A CN200510114848A CN1763843A CN 1763843 A CN1763843 A CN 1763843A CN A2005101148488 A CNA2005101148488 A CN A2005101148488A CN 200510114848 A CN200510114848 A CN 200510114848A CN 1763843 A CN1763843 A CN 1763843A
Authority
CN
China
Prior art keywords
model
voice
pronunciation
received pronunciation
phonetic feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CNA2005101148488A
Other languages
Chinese (zh)
Other versions
CN100411011C (en
Inventor
梁维谦
董明
丁玉国
刘润生
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CNB2005101148488A priority Critical patent/CN100411011C/en
Publication of CN1763843A publication Critical patent/CN1763843A/en
Application granted granted Critical
Publication of CN100411011C publication Critical patent/CN100411011C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Electrically Operated Instructional Devices (AREA)

Abstract

The invention discloses a pronunciation quality evaluation method of language study machine in the computer subsidiary language study and phonetic technique domain, which is characterized by the following: extracting exercise phonetic feature; exercising standard pronunciation model; forming standard pronunciation network; detecting phonetic end; extracting evaluation phonetic feature; searching optimum path; calculating the mark of pronunciation quality. The method displays objective and stable evaluation, which constitutes imbedded English study system and mutual human-machine education and oral English self-detection.

Description

The pronunciation quality evaluating method that is used for language learner
Technical field
The invention belongs to computer-assisted language learning and voice technology field, relate in particular to the pronunciation quality evaluating method that 16 of employings and above digital signal processing chip are realized.
Background technology
In recent years, embedded language study product at home and abroad develops rapidly.Mainly be language repeater in early days, it is the additional device that a bit of voice digitization can be stored on an Analog Tape Recorder, and these a bit of voice can repeatedly repeat playback, are beneficial to learner's audition repeatedly, with reading memory.The language learner of main flow is the second generation product that adopts digital signal processing chip (DSP, Digital Signal Processing) technology in the market.Hardware system generally comprises microcontroller (Micro Control Unit, MCU), digital signal processing chip (DSP), codec (CODEC), ROM, SRAM, flash memory (Flash Memory), USB (universal serial bus) (USB), keyboard and LCD (Liquid Crystal Display, LCD) etc.; Wherein MCU is as main control chip, and actuating equipment drives and system control program such as program scheduler, and DSP carries out the application algorithm routine.Application program comprises basic modules such as recording, playback, word speed adjusting, and some product also has the mp3 module.Have re-readingly on the function, with reading, with reading contrast, literal shows synchronously, content retrieval inquiry and the adjustable playback of word speed etc.This class learning machine can and upgrade learning stuff by the Internet download mostly.Typical case's representative that the good note star English learning machine of star company is the digital English learning machine of the second generation is remembered in Shenzhen well.
Learn a language and especially learn the learning process that spoken key is interaction, promptly the teacher passes judgment on targetedly in time and instructs in learning process.Yet traditional be in the language learning at center with teacher because the shortage of qualified teachers' strength, this task can't be finished.And existing language learner does not possess the ability of this evaluation learner pronunciation.
Summary of the invention
The objective of the invention is for overcoming the weak point of prior art, a kind of pronunciation quality evaluating method that is used for language learner is proposed, the pronunciation quality evaluating that can have nothing to do the high performance text of the realization on the embedded language learning machine and speaker has the method moderate complexity, estimates accuracy height and the good characteristics of robustness.Particularly the crowd's of Chinese accent evaluation accuracy is reached even surpassed current international most advanced level.
The pronunciation quality evaluating method that is used for language learner that the present invention proposes, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:
A, the phonetic feature that is used to train extract:
(1) foundation comprises the tranining database of reading aloud voice in a large number in advance;
(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are extracted phonetic feature, this phonetic feature is a cepstrum coefficient;
B, Received Pronunciation model training
(1) utilize the said phonetic feature of steps A to train the Received Pronunciation model that obtains based on phoneme;
(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent is as final Received Pronunciation model, and Optimization Model is to Chinese crowd's assess performance;
The generation of C, Received Pronunciation network
Given text is carried out the segmentation of words, search Pronounceable dictionary and obtain the phoneme mark, utilizing said Received Pronunciation model based on phoneme to obtain with the state at last is the linear Received Pronunciation network of node;
D, sound end detect:
(1) analog voice signal obtains digital speech through the A/D conversion;
(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are calculated time domain logarithm energy;
(4) method of employing moving average filter (moving-average filter) is obtained being used for the feature (being designated hereinafter simply as end inspection feature) of end-point detection by said time domain logarithm energy;
(5) method of employing upper and lower bound dual threshold and finite state machine combination is carried out end-point detection to said end inspection feature, obtains the starting and ending end points of voice;
E, the phonetic feature that is used to estimate extract
Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.
F, optimum route search:
(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network;
(2) utilize said routing information, the terminal node that allows from network is recalled and optimal path;
The calculating of G, voice quality mark:
(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature;
(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence;
(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval, obtain final voice quality mark.
Cepstrum coefficient in the said steps A can be Mei Er frequency marking cepstrum coefficient (MFCC, Mel-FrequencyCepstrum Coefficients), and it has utilized the frequency discrimination characteristic of people's ear.
Received Pronunciation model among the said step B (1) is the hidden Markov model (HMM based on phoneme, HiddenMarkov Model), the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained.Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.
The adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out based on linear (the Maximum Likelihood LinearRegression of recurrence of maximum likelihood, MLLR) and maximum a posteriori probability (Maximum A Posteriori, MAP) the accent self-adaptation of method obtains final Received Pronunciation model.
The Received Pronunciation network of said step C can be one and has definite start node and terminal node, and the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.
The method of the optimum route search of said step F has adopted the method for frame synchronization Viterbi (Viterbi) beam search.
For can on the limited memory resource of embedded system, realizing pronunciation quality evaluating method of the present invention, step D, E, F and G carry out with the segmentation in time of predefined fixedly frame number step-length, can reduce requirement so greatly, make the flush type learning system can handle long voice system resource;
Pronunciation quality evaluating method of the present invention makes language learner have interactive function.The embedded English learning system that utilizes this method to realize has been obtained preferable performance in actual applications.
The present invention has following characteristics:
Characteristics such as (1) the present invention has the accuracy of evaluation height, robustness is good, system resource overhead is little;
(2) employing makes the flush type learning system to change courseware content easily based on the Received Pronunciation model of phoneme, need not to train again;
(3) consider the influence of mother tongue accent, phoneme model has been carried out the accent self-adaptation pronunciation of English;
(4) adopt technology such as moving average filter and finite state machine to carry out real-time end-point detection, improved accuracy and the robustness of end-point detection English Phonetics;
(5) can be used for based on the embedded language learning system that with DSP is core, have that volume is little, in light weight, a power consumptive province, outstanding feature that cost is low;
(6) pronunciation quality evaluating method of the present invention, the courseware form in conjunction with abundant can change traditional learning machine mode of operation and classroom instruction pattern.
Description of drawings
Fig. 1 is the method overall procedure synoptic diagram of the embodiment of the invention.
Fig. 2 is the Received Pronunciation model training process flow diagram of the embodiment of the invention; The overall process of Fig. 2 (a) expression Received Pronunciation model training, the training process of a specific hidden Markov model of Fig. 2 (b) expression.
Fig. 3 is the topology diagram of the Received Pronunciation model of the embodiment of the invention; Fig. 3 (a) expression pause model, Fig. 3 (b) expression phoneme and quiet model.
Fig. 4 is the hidden Markov model accent self adaptive flow figure of the embodiment of the invention.
Fig. 5 is the Received Pronunciation topology of networks synoptic diagram of the embodiment of the invention; The whole sentence of Fig. 5 (a) expression is the Linear Network structure of node with the word, and Fig. 5 (b) represents that each word is the Linear Network structure of node with the phoneme.
Fig. 6 is the generative process synoptic diagram of the Received Pronunciation network of the embodiment of the invention.
Fig. 7 is the detail flowchart of the pronunciation quality evaluating method of realizing of the embodiment of the invention on embedded platform.
Embodiment
A kind of pronunciation quality evaluating method embodiment that is used for language learner that the present invention proposes is described in detail as follows in conjunction with each figure:
The embodiment overall procedure of the inventive method is divided into as shown in Figure 1: A, the phonetic feature that is used to train extract; The training of B, Received Pronunciation model; The generation of C, Received Pronunciation network (above each step can utilize computing machine to finish in advance); D, sound end detect; E, the phonetic feature that is used to estimate extract; F, optimum route search; The calculating of G, voice quality mark and output (above each step utilizes embedded platform to finish).The embodiment of each step is described in detail as follows.
A, the phonetic feature that is used to train extract:
(1) sets up in advance and comprise a large amount of English and read aloud the tranining database of voice (content that requirement comprises all has the covering of some to each phoneme);
(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and handle, preemphasis filter is taken as H (z)=1-0.9375z -1Voice after the pre-emphasis are carried out branch frame windowing (employing hamming code window) handle, frame length can be 32ms, and frame moves and can be 16ms, obtains having the branch frame voice of accurate stationarity;
(3) said minute frame voice are extracted Mei Er frequency marking cepstrum coefficient (MFCC) as phonetic feature; The frequency domain character in short-term of voice can accurately be described the variation of voice, MFCC is a kind of eigenvector that comes out according to the frequency discrimination property calculation of human auditory system, be based upon on the basis of fourier spectrum analysis, the computing method of MFCC are: at first minute frame voice are carried out fast fourier transform (Fast Fourier Transformation, FFT) obtain the short-term spectrum of signal, secondly according to the MEL frequency marking short-term spectrum is divided into the logical group of several bands, the frequency response that its band is logical is a triangle, calculate the signal energy of respective filter group once more, calculate corresponding cepstrum coefficient by discrete cosine transform at last;
The MFCC feature mainly reflects the static nature of voice, and the behavioral characteristics of voice signal can be composed with the first order difference of static nature spectrum and second order difference and describe.Whole phonetic feature is made of MFCC parameter, MFCC single order, second order difference coefficient, normalized energy coefficient and single order thereof, second order difference coefficient.Every frame comprises 39 dimensional features altogether;
The training of B, Received Pronunciation model:
(1) utilize the process of the said phonetic feature training of steps A based on the Received Pronunciation model of phoneme, as shown in Figure 2:
A, according to the prototype of the multidimensional Gaussian distribution of the dimension of the phonetic feature single data stream that to set up a covariance matrix be the diagonal angle form, use whole speech datas to estimate the mean value vector and the covariance matrix of this Gaussian distribution.
B, determine Pronounceable dictionary and phonetic symbol system, finish the phoneme level mark to all voice, the phonetic symbol system of present embodiment comprises 40 phonemes and 1 quiet sign, 1 sign of pausing.
C, present embodiment adopt hidden Markov model (HMM) based on phoneme as the Received Pronunciation model, and HMM is the statistics of speech recognition model that is widely adopted at present.HMM state transition model from left to right can be described the pronunciation characteristic of voice well.Phoneme and quiet model that the present invention adopts are the HMM of 3 states, and the pause model is the HMM that single state can be crossed over, and its topological structure as shown in Figure 3.Q wherein iThe state of expression HMM.a IjThe redirect probability of expression HMM.b j(O t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model.As the formula (1).
b j ( O t ) = Π s = 1 s [ Σ m = 1 M s C jsm N ( O st ; μ jsm ; φ jsm ) ] γ s - - - ( 1 )
Wherein S is the fluxion of data, M sIt is the number of the mixed Gaussian Density Distribution in each data stream; N is the multidimensional Gaussian distribution, as the formula (2):
N ( o ; μ ; φ ) = 1 ( 2 π ) n | φ | e - 1 2 ( o - μ ) φ - 1 ( o - μ ) - - - ( 2 )
The Received Pronunciation model of present embodiment comprises 40 phoneme HMM models and a quiet HMM model, a pause HMM model; Said Gaussian distribution prototype is copied into each HMM model; Utilize the Baum-Welch algorithm that each HMM model is carried out repeatedly valuation then, the valuation number of times can be 5;
D, progressively increase the quantity of gauss component in the HMM model, the model that obtains is carried out the Baum-Welch training once more; The quantity increase of gauss component is followed successively by 2,4,6,8; When Gauss's quantity rises to 8, repetition training 10 times, training process finishes.
(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent, the embodiment of the invention has adopted the accent adaptive approach based on overall MLLR and MAP serial, and the self-adaptation number of times is set at 4, and idiographic flow is as shown in Figure 4.
A, MLLR are a kind of adaptive algorithms based on model transferring.The basic assumption of this class algorithm is that close voice also are close in the irrelevant speech model space of speaker and by the transformation relation between the adaptation people voice space, therefore can utilize the voice that occurred in the training utterance to count this transformation relation, to not model mapping from speaker's independence model to quilt adaptation people voice space of the voice of appearance, thereby realize self-adaptation with this conversion realization.The speech model space is divided into the R class according to necessarily estimating (as Euclidean distance, likelihood score etc.), all kinds of T that are transformed to r(*), the training utterance collection of all kinds of correspondences is X r, r=1,2 ..., R, model parameter is λ r, r=1,2 ..., R, then adaptive training satisfies:
T r = arg max T ( P ( X r | T r ) ) , r = 1,2 . . . , R - - - ( 3 )
Parameter after the self-adaptation λ ^ r r = 1,2 , . . . R Satisfy
λ ^ r = T r ( λ r ) , r = 1,2 . . . , R - - - ( 4 )
Because this class algorithm has made full use of the mutual relationship between voice, a plurality of models are shared a conversion, the parameter that needs to estimate is the coefficient of each conversion, and being easier to the accumulation data estimated parameter can come into force under less self-adapting data situation, therefore has adaptive speed faster.What the embodiment of the invention adopted is non-classified overall MLLR self-adaptation.
The basic norm of b, MAP algorithm is the posterior probability maximization, therefore has theoretic optimality:
θ ^ i = arg max θ i P ( θ i | x ) - - - ( 5 )
The mean value vector valuation formula of standard MAP algorithm is:
μ ^ = Σ t = 1 T L t Σ t = 1 T L t + τ μ _ + τ Σ t = 1 T L t + τ μ - - - ( 6 )
L wherein tBe the probability of t moment measurement vector to this Gaussian mixture components, τ is the weights of adaptive voice data based on priori, and μ is the mean value vector of adaptive voice, and μ is the mean value vector of speaker's independence model.Also can find out thus, when self-adapting data is abundant, the mean value vector after the self-adaptation The mean value vector μ that the speaker is correlated with will be trended towards.The embodiment of the invention adopts the adaptive purpose of MAP again after the MLLR self-adaptation be to make full use of adaptive speech data, further provides accent adaptive effect.
The Received Pronunciation model that finally obtains is deposited in the external memory storage of embedded system.
The generation of C, Received Pronunciation network:
The Received Pronunciation network of the embodiment of the invention as shown in Figure 5, wherein (a) is for being the Linear Network example of node with the word, start node is " sil " of beginning, terminal node is last " sil ", (b) for each word inside be the Linear Network of node with the phoneme, each phoneme inner for as shown in Figure 3 be the network of node with the state.The network generative process is as shown in Figure 6: at first urtext is carried out the segmentation of words and obtain shown in Fig. 5 (a), secondly each word lookup Pronounceable dictionary is obtained shown in Fig. 5 (b).Consider the multiple sound situation of word, for saving storage space and improving search efficiency, present embodiment has carried out the phoneme character string comparison based on dynamic programming between the multiple pronunciation of word, it is the network of node with the phoneme that a plurality of aligned phoneme sequence are fused into one, makes that the identical phoneme between each pronunciation is shared.Utilizing phoneme HMM model that network finally is launched into the state at last is the network of node, has write down status indicator, phoneme sign, word sign and the preorder interstitial content and the preorder node identification information of present node on each state node.So far, obtain definite start node P of having of present embodiment and terminal node T, the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Received Pronunciation network of node.
Said Received Pronunciation network is deposited in the external memory storage of embedded system.
D, sound end detect:
(1) voice signal at first carries out low-pass filtering, samples by the linear A/D of 16bit then and quantizes, and becomes digital speech.Sample frequency is 8kHz;
(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity; Method is identical with (2) step of steps A;
(3) more said minute frame voice are calculated logarithm energy in short-term.
(4) adopt the method for moving average filter to obtain end inspection feature by said time domain logarithm energy: end-point detection is carried out in real time, and the real time end point detecting method need satisfy following requirement: a, different background-noise levels is had consistent output; B, can detect starting point and terminating point; C, the time-delay of weak point; D, limited response interval; E, in end points place maximization signal to noise ratio (S/N ratio); The end points of f, accurate detection and localization; G, suppress to detect mistake to greatest extent; It is closely similar to take all factors into consideration the graphic limit detection function (moving average filter) that adopts usually in the objective function of above requirements definition and the Flame Image Process.Said moving average filter as the formula (7), wherein g () is a time domain logarithm energy, t is current frame number, and h () is a moving average filter, as the formula (8), as seen h () is an odd symmetry function, W is desirable 13, f () as the formula (9), its parameter can be: A=0.2208, s=0.5383, [K 1... K 6]=[1.583,1.468 ,-0.078 ,-0.036 ,-0.872 ,-0.56].
F ( t ) = Σ i = - W W h ( i ) g ( t + i ) - - - ( 7 )
h ( i ) = - f ( - i ) - W &le; i < 0 f ( i ) 0 &le; i &le; W - - - ( 8 )
f(x)=e Ax[K 1sin(Ax)+K 2cos(Ax)]+e -Ax[K 3sin(Ax)+K 4cos(Ax)]+K 5+K 6e sx(9)
(5) method of employing upper and lower bound dual threshold and finite state machine combination, said end inspection feature is carried out end-point detection, obtain the starting and ending end points of voice: said end inspection feature F (t) the initiating terminal of voice on the occasion of, be negative value finishing end, then be close to zero at quiet section.According to the predefined upper limit, lower threshold and voice minimum length in time, control each frame voice at voice, quiet and leave and carry out redirect between the voice status.Be initially set mute state, the initial end points when F (t) exports voice when reaching upper limit threshold enters voice status.Be in voice status, leave voice status when F (t) has just entered when reaching lower threshold.Be in the end caps of the time of the leaving voice status time output voice that reach a preset threshold, close the recording channel, end-point detection finishes.
E, the phonetic feature that is used to estimate extract:
Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.
F, optimum route search:
(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network: the Received Pronunciation network of the embodiment of the invention is the Linear Network (as shown in Figure 5) behind left-hand, can adopt the Viterbi beam search algorithm of frame synchronization to obtain optimal path.Given HMM model Φ and observation vector sequence O={o 1..., o TAfter, need ask for producing the optimum condition sequence S={s that this observes vector sequence 1... s T, promptly
S ^ = arg max s { P ( S , O | &Phi; ) } - - - ( 10 )
In viterbi algorithm, definition t optimal path likelihood score constantly is
V i(t)=P(o 1,…,o t,s 1,…,s t-1,s t=i?|Φ) (11)
In Linear Network, the optimal path of any time is only relevant with the information of previous frame with present frame, promptly satisfies the principle of without aftereffect.Therefore, if the global optimum path at t constantly by node i, so, the path is in the part of 0~t between constantly, must be to be to be optimum in each paths of finish node constantly with the node i at t.If we only seek out optimal path, so constantly at t, with the node i be the path of finish node only need keep one just enough.
According to mentioned above principle, the searching algorithm of present embodiment is:
Definition: PreNode (i) is the preorder node set of node i.(t i) is the t optimum preorder node of node i constantly to BestPre.(t i) is the likelihood mark of t speech frame corresponding node i constantly to L.L_Path (1, i) and L_Path (0, i) be respectively the optimal path likelihood mark that former frame and present frame are finish node with the node i.
Step 1: constantly at t=0
L _ Path ( - 1 , i ) = L ( 0 , i ) i &Element; Entry 0 i &NotElement; Entry - - - ( 12 )
Wherein i ∈ Entry represents that i is a start node.
Step 2: at t constantly, for node i arbitrarily obtained present frame likelihood mark L (t, i), then the optimal path mark of present frame is:
L _ Path ( 0 , i ) = max j ( L _ Path ( - 1 , j ) ) + L ( t , i ) , &ForAll; j &Element; PreNode ( i ) - - - ( 13 )
With optimum preorder node charge to BestPre (t, i), with L_Path (1, i) and L_Path (0, data i) are exchanged for the calculating of next frame and prepare.
Step 3: if t<T forwards step 2 to; Otherwise, finish.
(2) when voice finish, can allow to such an extent that terminal node is recalled BestPre (t i) is got access to the optimum state path of forcing coupling from network;
The calculating of G, voice quality mark
(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature, as the formula (14):
C j = log ( p ( O j | s i ) ) - log ( &Sigma; i p ( O j | s i ) ) - - - ( 14 )
(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence, and as the formula (15), wherein N is the status number that optimal path comprises.
C = &Sigma; i = 1 N ( &Sigma; j = js je C j je - js ) N - - - ( 15 )
(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval: the interval of the confidence score that directly calculates is usually at (∞, a] between, wherein a is a constant, divide number interval inconsistent with subjective assessment, the embodiment of the invention utilizes piecewise linear function that it is mapped to the subjective scores interval, as the formula (16), wherein a and b determine that by experiment a is a regulatory factor:
S = &alpha;C if a &le; C &le; b 100 if C > b 0 if C < a - - - ( 16 )
Also the S that obtains further can be quantified as excellent, good, in, the difference the voice quality grade.
Consider the restriction of memory source, the step D of the embodiment of the invention, E, F and G carry out with the segmentation in time of predefined fixedly frame number step-length, and every section size can be 40 frames.
Present embodiment has been developed embedded English learning system based on pronunciation quality evaluating based on said method.Learning content can require automatically to upgrade at any time according to teaching.Adopt the pronunciation quality evaluating technology can make man-machine between interactive learning, alleviated the workload of classroom oral English teaching greatly, alleviated the problem of teacher's supply and demand anxiety, realized the autonomous learning of Oral English Practice and test automatically.The present invention can estimate standard Chinese crowd's pronunciation of English quality.This method when grading system is 4 grades (excellent, good, in, poor), has reached 0.74 with the correlativity of subjective assessment to Chinese crowd's pronunciation of English quality assessment.

Claims (5)

1, a kind of pronunciation quality evaluating method that is used for language learner, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:
A, the phonetic feature that is used to train extract:
(1) foundation comprises the tranining database of reading aloud voice in a large number in advance;
(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are extracted phonetic feature, this phonetic feature is a cepstrum coefficient;
B, Received Pronunciation model training
(1) utilize the said phonetic feature of steps A to train the Received Pronunciation model that obtains based on phoneme;
(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent is as final Received Pronunciation model, and Optimization Model is to Chinese crowd's assess performance;
The generation of C, Received Pronunciation network
Given text is carried out the segmentation of words, search Pronounceable dictionary and obtain the phoneme mark, utilizing said Received Pronunciation model based on phoneme to obtain with the state at last is the linear Received Pronunciation network of node;
D, sound end detect:
(1) analog voice signal obtains digital speech through the A/D conversion;
(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are calculated time domain logarithm energy;
(4) method of employing moving average filter is obtained being used for the end inspection feature of end-point detection by said time domain logarithm energy;
(5) method of employing upper and lower bound dual threshold and finite state machine combination is carried out end-point detection to said end inspection feature, obtains the starting and ending end points of voice;
E, the phonetic feature that is used to estimate extract
Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.
F, optimum route search:
(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network;
(2) utilize said routing information, the terminal node that allows from network is recalled and optimal path;
The calculating of G, voice quality mark:
(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature;
(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence;
(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval, obtain final voice quality mark.
2, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1 is characterized in that, the cepstrum coefficient in the said steps A is the Mei Er frequency marking cepstrum coefficient that utilizes the frequency discrimination characteristic of people's ear.
3, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, Received Pronunciation model among the said step B (1) is the hidden Markov model based on phoneme, the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained; Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.
4, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out returning and the accent self-adaptation of maximum a posteriori probability method based on maximum likelihood is linear, obtain final Received Pronunciation model.
5, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the Received Pronunciation network of said step C is one and has definite start node and terminal node that the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.
CNB2005101148488A 2005-11-18 2005-11-18 Pronunciation quality evaluating method for language learning machine Expired - Fee Related CN100411011C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2005101148488A CN100411011C (en) 2005-11-18 2005-11-18 Pronunciation quality evaluating method for language learning machine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2005101148488A CN100411011C (en) 2005-11-18 2005-11-18 Pronunciation quality evaluating method for language learning machine

Publications (2)

Publication Number Publication Date
CN1763843A true CN1763843A (en) 2006-04-26
CN100411011C CN100411011C (en) 2008-08-13

Family

ID=36747941

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2005101148488A Expired - Fee Related CN100411011C (en) 2005-11-18 2005-11-18 Pronunciation quality evaluating method for language learning machine

Country Status (1)

Country Link
CN (1) CN100411011C (en)

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2009097738A1 (en) * 2008-01-30 2009-08-13 Institute Of Computing Technology, Chinese Academy Of Sciences Method and system for audio matching
CN101996635A (en) * 2010-08-30 2011-03-30 清华大学 English pronunciation quality evaluation method based on accent highlight degree
CN101246685B (en) * 2008-03-17 2011-03-30 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN101105894B (en) * 2006-07-12 2011-08-10 陈修志 Multifunctional language learning machine
CN102237086A (en) * 2010-04-28 2011-11-09 三星电子株式会社 Compensation device and method for voice recognition equipment
CN102253976A (en) * 2011-06-17 2011-11-23 苏州思必驰信息科技有限公司 Metadata processing method and system for spoken language learning
CN101826263B (en) * 2009-03-04 2012-01-04 中国科学院自动化研究所 Objective standard based automatic oral evaluation system
CN101739868B (en) * 2008-11-19 2012-03-28 中国科学院自动化研究所 Automatic evaluation and diagnosis method of text reading level for oral test
WO2012055113A1 (en) * 2010-10-29 2012-05-03 安徽科大讯飞信息科技股份有限公司 Method and system for endpoint automatic detection of audio record
CN102568475A (en) * 2011-12-31 2012-07-11 安徽科大讯飞信息科技股份有限公司 System and method for assessing proficiency in Putonghua
US8385221B2 (en) 2010-02-28 2013-02-26 International Business Machines Corporation System and method for monitoring of user quality-of-experience on a wireless network
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN103177733A (en) * 2013-03-11 2013-06-26 哈尔滨师范大学 Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality
CN103366759A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 Speech data evaluation method and speech data evaluation device
CN104766607A (en) * 2015-03-05 2015-07-08 广州视源电子科技股份有限公司 Television program recommendation method and system
CN105261246A (en) * 2015-12-02 2016-01-20 武汉慧人信息科技有限公司 Spoken English error correcting system based on big data mining technology
CN105529030A (en) * 2015-12-29 2016-04-27 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN106328123A (en) * 2016-08-25 2017-01-11 苏州大学 Method of recognizing ear speech in normal speech flow under condition of small database
CN106558308A (en) * 2016-12-02 2017-04-05 深圳撒哈拉数据科技有限公司 A kind of internet audio quality of data auto-scoring system and method
CN106803424A (en) * 2015-11-26 2017-06-06 北京奥鹏远程教育中心有限公司 A kind of Chinese proficiency measuring technology
CN106847308A (en) * 2017-02-08 2017-06-13 西安医学院 A kind of pronunciation of English QA system
CN107767858A (en) * 2017-09-08 2018-03-06 科大讯飞股份有限公司 Pronunciation dictionary generation method and device, storage medium, electronic equipment
CN107958673A (en) * 2017-11-28 2018-04-24 北京先声教育科技有限公司 A kind of spoken language methods of marking and device
CN108520749A (en) * 2018-03-06 2018-09-11 杭州孚立计算机软件有限公司 A kind of voice-based grid-based management control method and control device
CN109313892A (en) * 2017-05-17 2019-02-05 北京嘀嘀无限科技发展有限公司 Steady language identification method and system
CN109859773A (en) * 2019-02-14 2019-06-07 北京儒博科技有限公司 A kind of method for recording of sound, device, storage medium and electronic equipment
CN110390948A (en) * 2019-07-24 2019-10-29 厦门快商通科技股份有限公司 A kind of method and system of Rapid Speech identification
CN110415725A (en) * 2019-07-15 2019-11-05 北京语言大学 Use the method and system of first language data assessment second language pronunciation quality
CN111128181A (en) * 2019-12-09 2020-05-08 科大讯飞股份有限公司 Recitation question evaluation method, device and equipment
CN111710332A (en) * 2020-06-30 2020-09-25 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN112530455A (en) * 2020-11-24 2021-03-19 东风汽车集团有限公司 Automobile door closing sound quality evaluation method and evaluation system based on MFCC

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1236928A (en) * 1998-05-25 1999-12-01 郭巧 Computer aided Chinese intelligent education system and its implementation method
CN1123863C (en) * 2000-11-10 2003-10-08 清华大学 Information check method based on speed recognition

Cited By (41)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101105894B (en) * 2006-07-12 2011-08-10 陈修志 Multifunctional language learning machine
WO2009097738A1 (en) * 2008-01-30 2009-08-13 Institute Of Computing Technology, Chinese Academy Of Sciences Method and system for audio matching
CN101246685B (en) * 2008-03-17 2011-03-30 清华大学 Pronunciation quality evaluation method of computer auxiliary language learning system
CN101739868B (en) * 2008-11-19 2012-03-28 中国科学院自动化研究所 Automatic evaluation and diagnosis method of text reading level for oral test
CN101826263B (en) * 2009-03-04 2012-01-04 中国科学院自动化研究所 Objective standard based automatic oral evaluation system
US8385221B2 (en) 2010-02-28 2013-02-26 International Business Machines Corporation System and method for monitoring of user quality-of-experience on a wireless network
CN102237086A (en) * 2010-04-28 2011-11-09 三星电子株式会社 Compensation device and method for voice recognition equipment
CN101996635A (en) * 2010-08-30 2011-03-30 清华大学 English pronunciation quality evaluation method based on accent highlight degree
KR101417975B1 (en) * 2010-10-29 2014-07-09 안후이 유에스티씨 아이플라이텍 캄파니 리미티드 Method and system for endpoint automatic detection of audio record
US9330667B2 (en) 2010-10-29 2016-05-03 Iflytek Co., Ltd. Method and system for endpoint automatic detection of audio record
WO2012055113A1 (en) * 2010-10-29 2012-05-03 安徽科大讯飞信息科技股份有限公司 Method and system for endpoint automatic detection of audio record
CN102253976B (en) * 2011-06-17 2013-05-15 苏州思必驰信息科技有限公司 Metadata processing method and system for spoken language learning
CN102253976A (en) * 2011-06-17 2011-11-23 苏州思必驰信息科技有限公司 Metadata processing method and system for spoken language learning
CN102568475A (en) * 2011-12-31 2012-07-11 安徽科大讯飞信息科技股份有限公司 System and method for assessing proficiency in Putonghua
CN102568475B (en) * 2011-12-31 2014-11-26 安徽科大讯飞信息科技股份有限公司 System and method for assessing proficiency in Putonghua
CN103366759A (en) * 2012-03-29 2013-10-23 北京中传天籁数字技术有限公司 Speech data evaluation method and speech data evaluation device
CN102982811A (en) * 2012-11-24 2013-03-20 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN102982811B (en) * 2012-11-24 2015-01-14 安徽科大讯飞信息科技股份有限公司 Voice endpoint detection method based on real-time decoding
CN103177733A (en) * 2013-03-11 2013-06-26 哈尔滨师范大学 Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality
CN103177733B (en) * 2013-03-11 2015-09-09 哈尔滨师范大学 Standard Chinese suffixation of a nonsyllabic "r" sound voice quality evaluating method and system
CN104766607A (en) * 2015-03-05 2015-07-08 广州视源电子科技股份有限公司 Television program recommendation method and system
CN106803424A (en) * 2015-11-26 2017-06-06 北京奥鹏远程教育中心有限公司 A kind of Chinese proficiency measuring technology
CN105261246A (en) * 2015-12-02 2016-01-20 武汉慧人信息科技有限公司 Spoken English error correcting system based on big data mining technology
CN105261246B (en) * 2015-12-02 2018-06-05 武汉慧人信息科技有限公司 A kind of Oral English Practice error correction system based on big data digging technology
CN105529030A (en) * 2015-12-29 2016-04-27 百度在线网络技术(北京)有限公司 Speech recognition processing method and device
CN106328123A (en) * 2016-08-25 2017-01-11 苏州大学 Method of recognizing ear speech in normal speech flow under condition of small database
CN106328123B (en) * 2016-08-25 2020-03-20 苏州大学 Method for recognizing middle ear voice in normal voice stream under condition of small database
CN106558308A (en) * 2016-12-02 2017-04-05 深圳撒哈拉数据科技有限公司 A kind of internet audio quality of data auto-scoring system and method
CN106558308B (en) * 2016-12-02 2020-05-15 深圳撒哈拉数据科技有限公司 Internet audio data quality automatic scoring system and method
CN106847308A (en) * 2017-02-08 2017-06-13 西安医学院 A kind of pronunciation of English QA system
CN109313892A (en) * 2017-05-17 2019-02-05 北京嘀嘀无限科技发展有限公司 Steady language identification method and system
CN109313892B (en) * 2017-05-17 2023-02-21 北京嘀嘀无限科技发展有限公司 Robust speech recognition method and system
CN107767858A (en) * 2017-09-08 2018-03-06 科大讯飞股份有限公司 Pronunciation dictionary generation method and device, storage medium, electronic equipment
CN107958673A (en) * 2017-11-28 2018-04-24 北京先声教育科技有限公司 A kind of spoken language methods of marking and device
CN108520749A (en) * 2018-03-06 2018-09-11 杭州孚立计算机软件有限公司 A kind of voice-based grid-based management control method and control device
CN109859773A (en) * 2019-02-14 2019-06-07 北京儒博科技有限公司 A kind of method for recording of sound, device, storage medium and electronic equipment
CN110415725A (en) * 2019-07-15 2019-11-05 北京语言大学 Use the method and system of first language data assessment second language pronunciation quality
CN110390948A (en) * 2019-07-24 2019-10-29 厦门快商通科技股份有限公司 A kind of method and system of Rapid Speech identification
CN111128181A (en) * 2019-12-09 2020-05-08 科大讯飞股份有限公司 Recitation question evaluation method, device and equipment
CN111710332A (en) * 2020-06-30 2020-09-25 北京达佳互联信息技术有限公司 Voice processing method and device, electronic equipment and storage medium
CN112530455A (en) * 2020-11-24 2021-03-19 东风汽车集团有限公司 Automobile door closing sound quality evaluation method and evaluation system based on MFCC

Also Published As

Publication number Publication date
CN100411011C (en) 2008-08-13

Similar Documents

Publication Publication Date Title
CN1763843A (en) Pronunciation quality evaluating method for language learning machine
CN101661675B (en) Self-sensing error tone pronunciation learning method and system
US9672816B1 (en) Annotating maps with user-contributed pronunciations
Gao et al. A study on robust detection of pronunciation erroneous tendency based on deep neural network.
CN101645271B (en) Rapid confidence-calculation method in pronunciation quality evaluation system
CN103928023A (en) Voice scoring method and system
CN101887725A (en) Phoneme confusion network-based phoneme posterior probability calculation method
CN101246685A (en) Pronunciation quality evaluation method of computer auxiliary language learning system
CN109377981B (en) Phoneme alignment method and device
US11676572B2 (en) Instantaneous learning in text-to-speech during dialog
CN109243460A (en) A method of automatically generating news or interrogation record based on the local dialect
CN110047474A (en) A kind of English phonetic pronunciation intelligent training system and training method
CN1787070A (en) Chip upper system for language learner
Suyanto et al. End-to-End speech recognition models for a low-resourced Indonesian Language
Sinclair et al. A semi-markov model for speech segmentation with an utterance-break prior
Xu English speech recognition and evaluation of pronunciation quality using deep learning
Rasanen Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level
Mary et al. Searching speech databases: features, techniques and evaluation measures
Liu et al. AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning
CN112767961B (en) Accent correction method based on cloud computing
Liu et al. Deriving disyllabic word variants from a Chinese conversational speech corpus
Yang et al. Landmark-based pronunciation error identification on Chinese learning
US8768697B2 (en) Method for measuring speech characteristics
Zheng An analysis and research on Chinese college students’ psychological barriers in oral English output from a cross-cultural perspective
Jin Design of Students' Spoken English Pronunciation Training System Based on Computer VB Platform.

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20080813

Termination date: 20191118