CN1763843A - Pronunciation quality evaluating method for language learning machine - Google Patents
Pronunciation quality evaluating method for language learning machine Download PDFInfo
- Publication number
- CN1763843A CN1763843A CNA2005101148488A CN200510114848A CN1763843A CN 1763843 A CN1763843 A CN 1763843A CN A2005101148488 A CNA2005101148488 A CN A2005101148488A CN 200510114848 A CN200510114848 A CN 200510114848A CN 1763843 A CN1763843 A CN 1763843A
- Authority
- CN
- China
- Prior art keywords
- model
- voice
- pronunciation
- received pronunciation
- phonetic feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Landscapes
- Electrically Operated Instructional Devices (AREA)
Abstract
The invention discloses a pronunciation quality evaluation method of language study machine in the computer subsidiary language study and phonetic technique domain, which is characterized by the following: extracting exercise phonetic feature; exercising standard pronunciation model; forming standard pronunciation network; detecting phonetic end; extracting evaluation phonetic feature; searching optimum path; calculating the mark of pronunciation quality. The method displays objective and stable evaluation, which constitutes imbedded English study system and mutual human-machine education and oral English self-detection.
Description
Technical field
The invention belongs to computer-assisted language learning and voice technology field, relate in particular to the pronunciation quality evaluating method that 16 of employings and above digital signal processing chip are realized.
Background technology
In recent years, embedded language study product at home and abroad develops rapidly.Mainly be language repeater in early days, it is the additional device that a bit of voice digitization can be stored on an Analog Tape Recorder, and these a bit of voice can repeatedly repeat playback, are beneficial to learner's audition repeatedly, with reading memory.The language learner of main flow is the second generation product that adopts digital signal processing chip (DSP, Digital Signal Processing) technology in the market.Hardware system generally comprises microcontroller (Micro Control Unit, MCU), digital signal processing chip (DSP), codec (CODEC), ROM, SRAM, flash memory (Flash Memory), USB (universal serial bus) (USB), keyboard and LCD (Liquid Crystal Display, LCD) etc.; Wherein MCU is as main control chip, and actuating equipment drives and system control program such as program scheduler, and DSP carries out the application algorithm routine.Application program comprises basic modules such as recording, playback, word speed adjusting, and some product also has the mp3 module.Have re-readingly on the function, with reading, with reading contrast, literal shows synchronously, content retrieval inquiry and the adjustable playback of word speed etc.This class learning machine can and upgrade learning stuff by the Internet download mostly.Typical case's representative that the good note star English learning machine of star company is the digital English learning machine of the second generation is remembered in Shenzhen well.
Learn a language and especially learn the learning process that spoken key is interaction, promptly the teacher passes judgment on targetedly in time and instructs in learning process.Yet traditional be in the language learning at center with teacher because the shortage of qualified teachers' strength, this task can't be finished.And existing language learner does not possess the ability of this evaluation learner pronunciation.
Summary of the invention
The objective of the invention is for overcoming the weak point of prior art, a kind of pronunciation quality evaluating method that is used for language learner is proposed, the pronunciation quality evaluating that can have nothing to do the high performance text of the realization on the embedded language learning machine and speaker has the method moderate complexity, estimates accuracy height and the good characteristics of robustness.Particularly the crowd's of Chinese accent evaluation accuracy is reached even surpassed current international most advanced level.
The pronunciation quality evaluating method that is used for language learner that the present invention proposes, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:
A, the phonetic feature that is used to train extract:
(1) foundation comprises the tranining database of reading aloud voice in a large number in advance;
(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are extracted phonetic feature, this phonetic feature is a cepstrum coefficient;
B, Received Pronunciation model training
(1) utilize the said phonetic feature of steps A to train the Received Pronunciation model that obtains based on phoneme;
(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent is as final Received Pronunciation model, and Optimization Model is to Chinese crowd's assess performance;
The generation of C, Received Pronunciation network
Given text is carried out the segmentation of words, search Pronounceable dictionary and obtain the phoneme mark, utilizing said Received Pronunciation model based on phoneme to obtain with the state at last is the linear Received Pronunciation network of node;
D, sound end detect:
(1) analog voice signal obtains digital speech through the A/D conversion;
(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are calculated time domain logarithm energy;
(4) method of employing moving average filter (moving-average filter) is obtained being used for the feature (being designated hereinafter simply as end inspection feature) of end-point detection by said time domain logarithm energy;
(5) method of employing upper and lower bound dual threshold and finite state machine combination is carried out end-point detection to said end inspection feature, obtains the starting and ending end points of voice;
E, the phonetic feature that is used to estimate extract
Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.
F, optimum route search:
(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network;
(2) utilize said routing information, the terminal node that allows from network is recalled and optimal path;
The calculating of G, voice quality mark:
(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature;
(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence;
(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval, obtain final voice quality mark.
Cepstrum coefficient in the said steps A can be Mei Er frequency marking cepstrum coefficient (MFCC, Mel-FrequencyCepstrum Coefficients), and it has utilized the frequency discrimination characteristic of people's ear.
Received Pronunciation model among the said step B (1) is the hidden Markov model (HMM based on phoneme, HiddenMarkov Model), the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained.Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.
The adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out based on linear (the Maximum Likelihood LinearRegression of recurrence of maximum likelihood, MLLR) and maximum a posteriori probability (Maximum A Posteriori, MAP) the accent self-adaptation of method obtains final Received Pronunciation model.
The Received Pronunciation network of said step C can be one and has definite start node and terminal node, and the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.
The method of the optimum route search of said step F has adopted the method for frame synchronization Viterbi (Viterbi) beam search.
For can on the limited memory resource of embedded system, realizing pronunciation quality evaluating method of the present invention, step D, E, F and G carry out with the segmentation in time of predefined fixedly frame number step-length, can reduce requirement so greatly, make the flush type learning system can handle long voice system resource;
Pronunciation quality evaluating method of the present invention makes language learner have interactive function.The embedded English learning system that utilizes this method to realize has been obtained preferable performance in actual applications.
The present invention has following characteristics:
Characteristics such as (1) the present invention has the accuracy of evaluation height, robustness is good, system resource overhead is little;
(2) employing makes the flush type learning system to change courseware content easily based on the Received Pronunciation model of phoneme, need not to train again;
(3) consider the influence of mother tongue accent, phoneme model has been carried out the accent self-adaptation pronunciation of English;
(4) adopt technology such as moving average filter and finite state machine to carry out real-time end-point detection, improved accuracy and the robustness of end-point detection English Phonetics;
(5) can be used for based on the embedded language learning system that with DSP is core, have that volume is little, in light weight, a power consumptive province, outstanding feature that cost is low;
(6) pronunciation quality evaluating method of the present invention, the courseware form in conjunction with abundant can change traditional learning machine mode of operation and classroom instruction pattern.
Description of drawings
Fig. 1 is the method overall procedure synoptic diagram of the embodiment of the invention.
Fig. 2 is the Received Pronunciation model training process flow diagram of the embodiment of the invention; The overall process of Fig. 2 (a) expression Received Pronunciation model training, the training process of a specific hidden Markov model of Fig. 2 (b) expression.
Fig. 3 is the topology diagram of the Received Pronunciation model of the embodiment of the invention; Fig. 3 (a) expression pause model, Fig. 3 (b) expression phoneme and quiet model.
Fig. 4 is the hidden Markov model accent self adaptive flow figure of the embodiment of the invention.
Fig. 5 is the Received Pronunciation topology of networks synoptic diagram of the embodiment of the invention; The whole sentence of Fig. 5 (a) expression is the Linear Network structure of node with the word, and Fig. 5 (b) represents that each word is the Linear Network structure of node with the phoneme.
Fig. 6 is the generative process synoptic diagram of the Received Pronunciation network of the embodiment of the invention.
Fig. 7 is the detail flowchart of the pronunciation quality evaluating method of realizing of the embodiment of the invention on embedded platform.
Embodiment
A kind of pronunciation quality evaluating method embodiment that is used for language learner that the present invention proposes is described in detail as follows in conjunction with each figure:
The embodiment overall procedure of the inventive method is divided into as shown in Figure 1: A, the phonetic feature that is used to train extract; The training of B, Received Pronunciation model; The generation of C, Received Pronunciation network (above each step can utilize computing machine to finish in advance); D, sound end detect; E, the phonetic feature that is used to estimate extract; F, optimum route search; The calculating of G, voice quality mark and output (above each step utilizes embedded platform to finish).The embodiment of each step is described in detail as follows.
A, the phonetic feature that is used to train extract:
(1) sets up in advance and comprise a large amount of English and read aloud the tranining database of voice (content that requirement comprises all has the covering of some to each phoneme);
(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and handle, preemphasis filter is taken as H (z)=1-0.9375z
-1Voice after the pre-emphasis are carried out branch frame windowing (employing hamming code window) handle, frame length can be 32ms, and frame moves and can be 16ms, obtains having the branch frame voice of accurate stationarity;
(3) said minute frame voice are extracted Mei Er frequency marking cepstrum coefficient (MFCC) as phonetic feature; The frequency domain character in short-term of voice can accurately be described the variation of voice, MFCC is a kind of eigenvector that comes out according to the frequency discrimination property calculation of human auditory system, be based upon on the basis of fourier spectrum analysis, the computing method of MFCC are: at first minute frame voice are carried out fast fourier transform (Fast Fourier Transformation, FFT) obtain the short-term spectrum of signal, secondly according to the MEL frequency marking short-term spectrum is divided into the logical group of several bands, the frequency response that its band is logical is a triangle, calculate the signal energy of respective filter group once more, calculate corresponding cepstrum coefficient by discrete cosine transform at last;
The MFCC feature mainly reflects the static nature of voice, and the behavioral characteristics of voice signal can be composed with the first order difference of static nature spectrum and second order difference and describe.Whole phonetic feature is made of MFCC parameter, MFCC single order, second order difference coefficient, normalized energy coefficient and single order thereof, second order difference coefficient.Every frame comprises 39 dimensional features altogether;
The training of B, Received Pronunciation model:
(1) utilize the process of the said phonetic feature training of steps A based on the Received Pronunciation model of phoneme, as shown in Figure 2:
A, according to the prototype of the multidimensional Gaussian distribution of the dimension of the phonetic feature single data stream that to set up a covariance matrix be the diagonal angle form, use whole speech datas to estimate the mean value vector and the covariance matrix of this Gaussian distribution.
B, determine Pronounceable dictionary and phonetic symbol system, finish the phoneme level mark to all voice, the phonetic symbol system of present embodiment comprises 40 phonemes and 1 quiet sign, 1 sign of pausing.
C, present embodiment adopt hidden Markov model (HMM) based on phoneme as the Received Pronunciation model, and HMM is the statistics of speech recognition model that is widely adopted at present.HMM state transition model from left to right can be described the pronunciation characteristic of voice well.Phoneme and quiet model that the present invention adopts are the HMM of 3 states, and the pause model is the HMM that single state can be crossed over, and its topological structure as shown in Figure 3.Q wherein
iThe state of expression HMM.a
IjThe redirect probability of expression HMM.b
j(O
t) be the multithread mixed Gaussian density probability distribution function of the state output of HMM model.As the formula (1).
Wherein S is the fluxion of data, M
sIt is the number of the mixed Gaussian Density Distribution in each data stream; N is the multidimensional Gaussian distribution, as the formula (2):
The Received Pronunciation model of present embodiment comprises 40 phoneme HMM models and a quiet HMM model, a pause HMM model; Said Gaussian distribution prototype is copied into each HMM model; Utilize the Baum-Welch algorithm that each HMM model is carried out repeatedly valuation then, the valuation number of times can be 5;
D, progressively increase the quantity of gauss component in the HMM model, the model that obtains is carried out the Baum-Welch training once more; The quantity increase of gauss component is followed successively by 2,4,6,8; When Gauss's quantity rises to 8, repetition training 10 times, training process finishes.
(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent, the embodiment of the invention has adopted the accent adaptive approach based on overall MLLR and MAP serial, and the self-adaptation number of times is set at 4, and idiographic flow is as shown in Figure 4.
A, MLLR are a kind of adaptive algorithms based on model transferring.The basic assumption of this class algorithm is that close voice also are close in the irrelevant speech model space of speaker and by the transformation relation between the adaptation people voice space, therefore can utilize the voice that occurred in the training utterance to count this transformation relation, to not model mapping from speaker's independence model to quilt adaptation people voice space of the voice of appearance, thereby realize self-adaptation with this conversion realization.The speech model space is divided into the R class according to necessarily estimating (as Euclidean distance, likelihood score etc.), all kinds of T that are transformed to
r(*), the training utterance collection of all kinds of correspondences is X
r, r=1,2 ..., R, model parameter is λ
r, r=1,2 ..., R, then adaptive training satisfies:
Parameter after the self-adaptation
Satisfy
Because this class algorithm has made full use of the mutual relationship between voice, a plurality of models are shared a conversion, the parameter that needs to estimate is the coefficient of each conversion, and being easier to the accumulation data estimated parameter can come into force under less self-adapting data situation, therefore has adaptive speed faster.What the embodiment of the invention adopted is non-classified overall MLLR self-adaptation.
The basic norm of b, MAP algorithm is the posterior probability maximization, therefore has theoretic optimality:
The mean value vector valuation formula of standard MAP algorithm is:
L wherein
tBe the probability of t moment measurement vector to this Gaussian mixture components, τ is the weights of adaptive voice data based on priori, and μ is the mean value vector of adaptive voice, and μ is the mean value vector of speaker's independence model.Also can find out thus, when self-adapting data is abundant, the mean value vector after the self-adaptation
The mean value vector μ that the speaker is correlated with will be trended towards.The embodiment of the invention adopts the adaptive purpose of MAP again after the MLLR self-adaptation be to make full use of adaptive speech data, further provides accent adaptive effect.
The Received Pronunciation model that finally obtains is deposited in the external memory storage of embedded system.
The generation of C, Received Pronunciation network:
The Received Pronunciation network of the embodiment of the invention as shown in Figure 5, wherein (a) is for being the Linear Network example of node with the word, start node is " sil " of beginning, terminal node is last " sil ", (b) for each word inside be the Linear Network of node with the phoneme, each phoneme inner for as shown in Figure 3 be the network of node with the state.The network generative process is as shown in Figure 6: at first urtext is carried out the segmentation of words and obtain shown in Fig. 5 (a), secondly each word lookup Pronounceable dictionary is obtained shown in Fig. 5 (b).Consider the multiple sound situation of word, for saving storage space and improving search efficiency, present embodiment has carried out the phoneme character string comparison based on dynamic programming between the multiple pronunciation of word, it is the network of node with the phoneme that a plurality of aligned phoneme sequence are fused into one, makes that the identical phoneme between each pronunciation is shared.Utilizing phoneme HMM model that network finally is launched into the state at last is the network of node, has write down status indicator, phoneme sign, word sign and the preorder interstitial content and the preorder node identification information of present node on each state node.So far, obtain definite start node P of having of present embodiment and terminal node T, the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Received Pronunciation network of node.
Said Received Pronunciation network is deposited in the external memory storage of embedded system.
D, sound end detect:
(1) voice signal at first carries out low-pass filtering, samples by the linear A/D of 16bit then and quantizes, and becomes digital speech.Sample frequency is 8kHz;
(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity; Method is identical with (2) step of steps A;
(3) more said minute frame voice are calculated logarithm energy in short-term.
(4) adopt the method for moving average filter to obtain end inspection feature by said time domain logarithm energy: end-point detection is carried out in real time, and the real time end point detecting method need satisfy following requirement: a, different background-noise levels is had consistent output; B, can detect starting point and terminating point; C, the time-delay of weak point; D, limited response interval; E, in end points place maximization signal to noise ratio (S/N ratio); The end points of f, accurate detection and localization; G, suppress to detect mistake to greatest extent; It is closely similar to take all factors into consideration the graphic limit detection function (moving average filter) that adopts usually in the objective function of above requirements definition and the Flame Image Process.Said moving average filter as the formula (7), wherein g () is a time domain logarithm energy, t is current frame number, and h () is a moving average filter, as the formula (8), as seen h () is an odd symmetry function, W is desirable 13, f () as the formula (9), its parameter can be: A=0.2208, s=0.5383, [K
1... K
6]=[1.583,1.468 ,-0.078 ,-0.036 ,-0.872 ,-0.56].
f(x)=e
Ax[K
1sin(Ax)+K
2cos(Ax)]+e
-Ax[K
3sin(Ax)+K
4cos(Ax)]+K
5+K
6e
sx(9)
(5) method of employing upper and lower bound dual threshold and finite state machine combination, said end inspection feature is carried out end-point detection, obtain the starting and ending end points of voice: said end inspection feature F (t) the initiating terminal of voice on the occasion of, be negative value finishing end, then be close to zero at quiet section.According to the predefined upper limit, lower threshold and voice minimum length in time, control each frame voice at voice, quiet and leave and carry out redirect between the voice status.Be initially set mute state, the initial end points when F (t) exports voice when reaching upper limit threshold enters voice status.Be in voice status, leave voice status when F (t) has just entered when reaching lower threshold.Be in the end caps of the time of the leaving voice status time output voice that reach a preset threshold, close the recording channel, end-point detection finishes.
E, the phonetic feature that is used to estimate extract:
Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.
F, optimum route search:
(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network: the Received Pronunciation network of the embodiment of the invention is the Linear Network (as shown in Figure 5) behind left-hand, can adopt the Viterbi beam search algorithm of frame synchronization to obtain optimal path.Given HMM model Φ and observation vector sequence O={o
1..., o
TAfter, need ask for producing the optimum condition sequence S={s that this observes vector sequence
1... s
T, promptly
In viterbi algorithm, definition t optimal path likelihood score constantly is
V
i(t)=P(o
1,…,o
t,s
1,…,s
t-1,s
t=i?|Φ) (11)
In Linear Network, the optimal path of any time is only relevant with the information of previous frame with present frame, promptly satisfies the principle of without aftereffect.Therefore, if the global optimum path at t constantly by node i, so, the path is in the part of 0~t between constantly, must be to be to be optimum in each paths of finish node constantly with the node i at t.If we only seek out optimal path, so constantly at t, with the node i be the path of finish node only need keep one just enough.
According to mentioned above principle, the searching algorithm of present embodiment is:
Definition: PreNode (i) is the preorder node set of node i.(t i) is the t optimum preorder node of node i constantly to BestPre.(t i) is the likelihood mark of t speech frame corresponding node i constantly to L.L_Path (1, i) and L_Path (0, i) be respectively the optimal path likelihood mark that former frame and present frame are finish node with the node i.
Step 1: constantly at t=0
Wherein i ∈ Entry represents that i is a start node.
Step 2: at t constantly, for node i arbitrarily obtained present frame likelihood mark L (t, i), then the optimal path mark of present frame is:
With optimum preorder node charge to BestPre (t, i), with L_Path (1, i) and L_Path (0, data i) are exchanged for the calculating of next frame and prepare.
Step 3: if t<T forwards step 2 to; Otherwise, finish.
(2) when voice finish, can allow to such an extent that terminal node is recalled BestPre (t i) is got access to the optimum state path of forcing coupling from network;
The calculating of G, voice quality mark
(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature, as the formula (14):
(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence, and as the formula (15), wherein N is the status number that optimal path comprises.
(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval: the interval of the confidence score that directly calculates is usually at (∞, a] between, wherein a is a constant, divide number interval inconsistent with subjective assessment, the embodiment of the invention utilizes piecewise linear function that it is mapped to the subjective scores interval, as the formula (16), wherein a and b determine that by experiment a is a regulatory factor:
Also the S that obtains further can be quantified as excellent, good, in, the difference the voice quality grade.
Consider the restriction of memory source, the step D of the embodiment of the invention, E, F and G carry out with the segmentation in time of predefined fixedly frame number step-length, and every section size can be 40 frames.
Present embodiment has been developed embedded English learning system based on pronunciation quality evaluating based on said method.Learning content can require automatically to upgrade at any time according to teaching.Adopt the pronunciation quality evaluating technology can make man-machine between interactive learning, alleviated the workload of classroom oral English teaching greatly, alleviated the problem of teacher's supply and demand anxiety, realized the autonomous learning of Oral English Practice and test automatically.The present invention can estimate standard Chinese crowd's pronunciation of English quality.This method when grading system is 4 grades (excellent, good, in, poor), has reached 0.74 with the correlativity of subjective assessment to Chinese crowd's pronunciation of English quality assessment.
Claims (5)
1, a kind of pronunciation quality evaluating method that is used for language learner, comprise that the phonetic feature that is used to train extracts, the Received Pronunciation model training, the generation of Received Pronunciation network, sound end detects, the phonetic feature that is used to estimate extracts, optimum route search, and the calculating each several part of voice quality mark; It is characterized in that the implementation method of each several part specifically may further comprise the steps:
A, the phonetic feature that is used to train extract:
(1) foundation comprises the tranining database of reading aloud voice in a large number in advance;
(2) digital speech in each voice document in the said tranining database is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are extracted phonetic feature, this phonetic feature is a cepstrum coefficient;
B, Received Pronunciation model training
(1) utilize the said phonetic feature of steps A to train the Received Pronunciation model that obtains based on phoneme;
(2) self-adaptation that said Received Pronunciation model is carried out Chinese crowd accent is as final Received Pronunciation model, and Optimization Model is to Chinese crowd's assess performance;
The generation of C, Received Pronunciation network
Given text is carried out the segmentation of words, search Pronounceable dictionary and obtain the phoneme mark, utilizing said Received Pronunciation model based on phoneme to obtain with the state at last is the linear Received Pronunciation network of node;
D, sound end detect:
(1) analog voice signal obtains digital speech through the A/D conversion;
(2) said digital speech is carried out pre-emphasis and divides frame windowing process, the branch frame voice that obtain having accurate stationarity;
(3) said minute frame voice are calculated time domain logarithm energy;
(4) method of employing moving average filter is obtained being used for the end inspection feature of end-point detection by said time domain logarithm energy;
(5) method of employing upper and lower bound dual threshold and finite state machine combination is carried out end-point detection to said end inspection feature, obtains the starting and ending end points of voice;
E, the phonetic feature that is used to estimate extract
Said minute frame voice of step D are extracted phonetic feature, and process is identical with (3) step of steps A.
F, optimum route search:
(1) said phonetic feature of step e and the said Received Pronunciation network of step C are forced coupling, obtain all possible routing information in the network;
(2) utilize said routing information, the terminal node that allows from network is recalled and optimal path;
The calculating of G, voice quality mark:
(1) utilize said optimal path information in the step F to calculate the confidence score of every frame phonetic feature;
(2) utilize in the step F confidence score of each state on the said optimal path information calculating path; Confidence score to all states on the optimal path is averaged the confidence score that obtains whole sentence;
(3) utilize mapping function that said whole sentence confidence score is mapped to subjective assessment and divide number interval, obtain final voice quality mark.
2, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1 is characterized in that, the cepstrum coefficient in the said steps A is the Mei Er frequency marking cepstrum coefficient that utilizes the frequency discrimination characteristic of people's ear.
3, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, Received Pronunciation model among the said step B (1) is the hidden Markov model based on phoneme, the concrete training process of this model is: adopt Gauss model of all phonetic feature initialization, utilize this model copy to go out all phoneme models, adopt the method for Baum-Welth that model is repeatedly trained; Constantly increase the quantity of the gauss component of each phoneme model, carry out the Baum-Welth training again.
4, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the adaptive implementation method that the Received Pronunciation model is carried out Chinese crowd accent among the said step B (2) is: the Received Pronunciation model that obtains is carried out returning and the accent self-adaptation of maximum a posteriori probability method based on maximum likelihood is linear, obtain final Received Pronunciation model.
5, the pronunciation quality evaluating method that is used for language learner as claimed in claim 1, it is characterized in that, the Received Pronunciation network of said step C is one and has definite start node and terminal node that the state with HMM of not considering the syntax that present node is only relevant with its preorder node is the Linear Network of node.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2005101148488A CN100411011C (en) | 2005-11-18 | 2005-11-18 | Pronunciation quality evaluating method for language learning machine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2005101148488A CN100411011C (en) | 2005-11-18 | 2005-11-18 | Pronunciation quality evaluating method for language learning machine |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1763843A true CN1763843A (en) | 2006-04-26 |
CN100411011C CN100411011C (en) | 2008-08-13 |
Family
ID=36747941
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2005101148488A Expired - Fee Related CN100411011C (en) | 2005-11-18 | 2005-11-18 | Pronunciation quality evaluating method for language learning machine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100411011C (en) |
Cited By (31)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009097738A1 (en) * | 2008-01-30 | 2009-08-13 | Institute Of Computing Technology, Chinese Academy Of Sciences | Method and system for audio matching |
CN101996635A (en) * | 2010-08-30 | 2011-03-30 | 清华大学 | English pronunciation quality evaluation method based on accent highlight degree |
CN101246685B (en) * | 2008-03-17 | 2011-03-30 | 清华大学 | Pronunciation quality evaluation method of computer auxiliary language learning system |
CN101105894B (en) * | 2006-07-12 | 2011-08-10 | 陈修志 | Multifunctional language learning machine |
CN102237086A (en) * | 2010-04-28 | 2011-11-09 | 三星电子株式会社 | Compensation device and method for voice recognition equipment |
CN102253976A (en) * | 2011-06-17 | 2011-11-23 | 苏州思必驰信息科技有限公司 | Metadata processing method and system for spoken language learning |
CN101826263B (en) * | 2009-03-04 | 2012-01-04 | 中国科学院自动化研究所 | Objective standard based automatic oral evaluation system |
CN101739868B (en) * | 2008-11-19 | 2012-03-28 | 中国科学院自动化研究所 | Automatic evaluation and diagnosis method of text reading level for oral test |
WO2012055113A1 (en) * | 2010-10-29 | 2012-05-03 | 安徽科大讯飞信息科技股份有限公司 | Method and system for endpoint automatic detection of audio record |
CN102568475A (en) * | 2011-12-31 | 2012-07-11 | 安徽科大讯飞信息科技股份有限公司 | System and method for assessing proficiency in Putonghua |
US8385221B2 (en) | 2010-02-28 | 2013-02-26 | International Business Machines Corporation | System and method for monitoring of user quality-of-experience on a wireless network |
CN102982811A (en) * | 2012-11-24 | 2013-03-20 | 安徽科大讯飞信息科技股份有限公司 | Voice endpoint detection method based on real-time decoding |
CN103177733A (en) * | 2013-03-11 | 2013-06-26 | 哈尔滨师范大学 | Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality |
CN103366759A (en) * | 2012-03-29 | 2013-10-23 | 北京中传天籁数字技术有限公司 | Speech data evaluation method and speech data evaluation device |
CN104766607A (en) * | 2015-03-05 | 2015-07-08 | 广州视源电子科技股份有限公司 | Television program recommendation method and system |
CN105261246A (en) * | 2015-12-02 | 2016-01-20 | 武汉慧人信息科技有限公司 | Spoken English error correcting system based on big data mining technology |
CN105529030A (en) * | 2015-12-29 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Speech recognition processing method and device |
CN106328123A (en) * | 2016-08-25 | 2017-01-11 | 苏州大学 | Method of recognizing ear speech in normal speech flow under condition of small database |
CN106558308A (en) * | 2016-12-02 | 2017-04-05 | 深圳撒哈拉数据科技有限公司 | A kind of internet audio quality of data auto-scoring system and method |
CN106803424A (en) * | 2015-11-26 | 2017-06-06 | 北京奥鹏远程教育中心有限公司 | A kind of Chinese proficiency measuring technology |
CN106847308A (en) * | 2017-02-08 | 2017-06-13 | 西安医学院 | A kind of pronunciation of English QA system |
CN107767858A (en) * | 2017-09-08 | 2018-03-06 | 科大讯飞股份有限公司 | Pronunciation dictionary generation method and device, storage medium, electronic equipment |
CN107958673A (en) * | 2017-11-28 | 2018-04-24 | 北京先声教育科技有限公司 | A kind of spoken language methods of marking and device |
CN108520749A (en) * | 2018-03-06 | 2018-09-11 | 杭州孚立计算机软件有限公司 | A kind of voice-based grid-based management control method and control device |
CN109313892A (en) * | 2017-05-17 | 2019-02-05 | 北京嘀嘀无限科技发展有限公司 | Steady language identification method and system |
CN109859773A (en) * | 2019-02-14 | 2019-06-07 | 北京儒博科技有限公司 | A kind of method for recording of sound, device, storage medium and electronic equipment |
CN110390948A (en) * | 2019-07-24 | 2019-10-29 | 厦门快商通科技股份有限公司 | A kind of method and system of Rapid Speech identification |
CN110415725A (en) * | 2019-07-15 | 2019-11-05 | 北京语言大学 | Use the method and system of first language data assessment second language pronunciation quality |
CN111128181A (en) * | 2019-12-09 | 2020-05-08 | 科大讯飞股份有限公司 | Recitation question evaluation method, device and equipment |
CN111710332A (en) * | 2020-06-30 | 2020-09-25 | 北京达佳互联信息技术有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN112530455A (en) * | 2020-11-24 | 2021-03-19 | 东风汽车集团有限公司 | Automobile door closing sound quality evaluation method and evaluation system based on MFCC |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1236928A (en) * | 1998-05-25 | 1999-12-01 | 郭巧 | Computer aided Chinese intelligent education system and its implementation method |
CN1123863C (en) * | 2000-11-10 | 2003-10-08 | 清华大学 | Information check method based on speed recognition |
-
2005
- 2005-11-18 CN CNB2005101148488A patent/CN100411011C/en not_active Expired - Fee Related
Cited By (41)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101105894B (en) * | 2006-07-12 | 2011-08-10 | 陈修志 | Multifunctional language learning machine |
WO2009097738A1 (en) * | 2008-01-30 | 2009-08-13 | Institute Of Computing Technology, Chinese Academy Of Sciences | Method and system for audio matching |
CN101246685B (en) * | 2008-03-17 | 2011-03-30 | 清华大学 | Pronunciation quality evaluation method of computer auxiliary language learning system |
CN101739868B (en) * | 2008-11-19 | 2012-03-28 | 中国科学院自动化研究所 | Automatic evaluation and diagnosis method of text reading level for oral test |
CN101826263B (en) * | 2009-03-04 | 2012-01-04 | 中国科学院自动化研究所 | Objective standard based automatic oral evaluation system |
US8385221B2 (en) | 2010-02-28 | 2013-02-26 | International Business Machines Corporation | System and method for monitoring of user quality-of-experience on a wireless network |
CN102237086A (en) * | 2010-04-28 | 2011-11-09 | 三星电子株式会社 | Compensation device and method for voice recognition equipment |
CN101996635A (en) * | 2010-08-30 | 2011-03-30 | 清华大学 | English pronunciation quality evaluation method based on accent highlight degree |
KR101417975B1 (en) * | 2010-10-29 | 2014-07-09 | 안후이 유에스티씨 아이플라이텍 캄파니 리미티드 | Method and system for endpoint automatic detection of audio record |
US9330667B2 (en) | 2010-10-29 | 2016-05-03 | Iflytek Co., Ltd. | Method and system for endpoint automatic detection of audio record |
WO2012055113A1 (en) * | 2010-10-29 | 2012-05-03 | 安徽科大讯飞信息科技股份有限公司 | Method and system for endpoint automatic detection of audio record |
CN102253976B (en) * | 2011-06-17 | 2013-05-15 | 苏州思必驰信息科技有限公司 | Metadata processing method and system for spoken language learning |
CN102253976A (en) * | 2011-06-17 | 2011-11-23 | 苏州思必驰信息科技有限公司 | Metadata processing method and system for spoken language learning |
CN102568475A (en) * | 2011-12-31 | 2012-07-11 | 安徽科大讯飞信息科技股份有限公司 | System and method for assessing proficiency in Putonghua |
CN102568475B (en) * | 2011-12-31 | 2014-11-26 | 安徽科大讯飞信息科技股份有限公司 | System and method for assessing proficiency in Putonghua |
CN103366759A (en) * | 2012-03-29 | 2013-10-23 | 北京中传天籁数字技术有限公司 | Speech data evaluation method and speech data evaluation device |
CN102982811A (en) * | 2012-11-24 | 2013-03-20 | 安徽科大讯飞信息科技股份有限公司 | Voice endpoint detection method based on real-time decoding |
CN102982811B (en) * | 2012-11-24 | 2015-01-14 | 安徽科大讯飞信息科技股份有限公司 | Voice endpoint detection method based on real-time decoding |
CN103177733A (en) * | 2013-03-11 | 2013-06-26 | 哈尔滨师范大学 | Method and system for evaluating Chinese mandarin retroflex suffixation pronunciation quality |
CN103177733B (en) * | 2013-03-11 | 2015-09-09 | 哈尔滨师范大学 | Standard Chinese suffixation of a nonsyllabic "r" sound voice quality evaluating method and system |
CN104766607A (en) * | 2015-03-05 | 2015-07-08 | 广州视源电子科技股份有限公司 | Television program recommendation method and system |
CN106803424A (en) * | 2015-11-26 | 2017-06-06 | 北京奥鹏远程教育中心有限公司 | A kind of Chinese proficiency measuring technology |
CN105261246A (en) * | 2015-12-02 | 2016-01-20 | 武汉慧人信息科技有限公司 | Spoken English error correcting system based on big data mining technology |
CN105261246B (en) * | 2015-12-02 | 2018-06-05 | 武汉慧人信息科技有限公司 | A kind of Oral English Practice error correction system based on big data digging technology |
CN105529030A (en) * | 2015-12-29 | 2016-04-27 | 百度在线网络技术(北京)有限公司 | Speech recognition processing method and device |
CN106328123A (en) * | 2016-08-25 | 2017-01-11 | 苏州大学 | Method of recognizing ear speech in normal speech flow under condition of small database |
CN106328123B (en) * | 2016-08-25 | 2020-03-20 | 苏州大学 | Method for recognizing middle ear voice in normal voice stream under condition of small database |
CN106558308A (en) * | 2016-12-02 | 2017-04-05 | 深圳撒哈拉数据科技有限公司 | A kind of internet audio quality of data auto-scoring system and method |
CN106558308B (en) * | 2016-12-02 | 2020-05-15 | 深圳撒哈拉数据科技有限公司 | Internet audio data quality automatic scoring system and method |
CN106847308A (en) * | 2017-02-08 | 2017-06-13 | 西安医学院 | A kind of pronunciation of English QA system |
CN109313892A (en) * | 2017-05-17 | 2019-02-05 | 北京嘀嘀无限科技发展有限公司 | Steady language identification method and system |
CN109313892B (en) * | 2017-05-17 | 2023-02-21 | 北京嘀嘀无限科技发展有限公司 | Robust speech recognition method and system |
CN107767858A (en) * | 2017-09-08 | 2018-03-06 | 科大讯飞股份有限公司 | Pronunciation dictionary generation method and device, storage medium, electronic equipment |
CN107958673A (en) * | 2017-11-28 | 2018-04-24 | 北京先声教育科技有限公司 | A kind of spoken language methods of marking and device |
CN108520749A (en) * | 2018-03-06 | 2018-09-11 | 杭州孚立计算机软件有限公司 | A kind of voice-based grid-based management control method and control device |
CN109859773A (en) * | 2019-02-14 | 2019-06-07 | 北京儒博科技有限公司 | A kind of method for recording of sound, device, storage medium and electronic equipment |
CN110415725A (en) * | 2019-07-15 | 2019-11-05 | 北京语言大学 | Use the method and system of first language data assessment second language pronunciation quality |
CN110390948A (en) * | 2019-07-24 | 2019-10-29 | 厦门快商通科技股份有限公司 | A kind of method and system of Rapid Speech identification |
CN111128181A (en) * | 2019-12-09 | 2020-05-08 | 科大讯飞股份有限公司 | Recitation question evaluation method, device and equipment |
CN111710332A (en) * | 2020-06-30 | 2020-09-25 | 北京达佳互联信息技术有限公司 | Voice processing method and device, electronic equipment and storage medium |
CN112530455A (en) * | 2020-11-24 | 2021-03-19 | 东风汽车集团有限公司 | Automobile door closing sound quality evaluation method and evaluation system based on MFCC |
Also Published As
Publication number | Publication date |
---|---|
CN100411011C (en) | 2008-08-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1763843A (en) | Pronunciation quality evaluating method for language learning machine | |
CN101661675B (en) | Self-sensing error tone pronunciation learning method and system | |
US9672816B1 (en) | Annotating maps with user-contributed pronunciations | |
Gao et al. | A study on robust detection of pronunciation erroneous tendency based on deep neural network. | |
CN101645271B (en) | Rapid confidence-calculation method in pronunciation quality evaluation system | |
CN103928023A (en) | Voice scoring method and system | |
CN101887725A (en) | Phoneme confusion network-based phoneme posterior probability calculation method | |
CN101246685A (en) | Pronunciation quality evaluation method of computer auxiliary language learning system | |
CN109377981B (en) | Phoneme alignment method and device | |
US11676572B2 (en) | Instantaneous learning in text-to-speech during dialog | |
CN109243460A (en) | A method of automatically generating news or interrogation record based on the local dialect | |
CN110047474A (en) | A kind of English phonetic pronunciation intelligent training system and training method | |
CN1787070A (en) | Chip upper system for language learner | |
Suyanto et al. | End-to-End speech recognition models for a low-resourced Indonesian Language | |
Sinclair et al. | A semi-markov model for speech segmentation with an utterance-break prior | |
Xu | English speech recognition and evaluation of pronunciation quality using deep learning | |
Rasanen | Basic cuts revisited: Temporal segmentation of speech into phone-like units with statistical learning at a pre-linguistic level | |
Mary et al. | Searching speech databases: features, techniques and evaluation measures | |
Liu et al. | AI recognition method of pronunciation errors in oral English speech with the help of big data for personalized learning | |
CN112767961B (en) | Accent correction method based on cloud computing | |
Liu et al. | Deriving disyllabic word variants from a Chinese conversational speech corpus | |
Yang et al. | Landmark-based pronunciation error identification on Chinese learning | |
US8768697B2 (en) | Method for measuring speech characteristics | |
Zheng | An analysis and research on Chinese college students’ psychological barriers in oral English output from a cross-cultural perspective | |
Jin | Design of Students' Spoken English Pronunciation Training System Based on Computer VB Platform. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20080813 Termination date: 20191118 |