CN104732977A

CN104732977A - On-line spoken language pronunciation quality evaluation method and system

Info

Publication number: CN104732977A
Application number: CN201510102425.8A
Authority: CN
Inventors: 李心广; 李苏梅; 徐集优; 张胜斌; 陈君宇; 李升恒; 朱小凡; 王泽铿; 许港帆; 陈嘉华; 林帆
Original assignee: Guangdong University of Foreign Studies
Current assignee: Guangdong University of Foreign Studies
Priority date: 2015-03-09
Filing date: 2015-03-09
Publication date: 2015-06-24
Anticipated expiration: 2035-03-09
Also published as: CN104732977B

Abstract

The invention discloses an on-line spoken language pronunciation quality evaluation method and a system. The method comprises the following steps that a test voice collected by a mobile client-side is received through a network; preprocessing is carried out on the received test voice; voice characteristic parameter extracting is carried out on the preprocessed test voice, and a characteristic parameter of the test voice is obtained; according to the characteristic parameter of the test voice and a characteristic parameter of a standard voice, evaluation is carried out on the test voice, and an evaluation result is obtained; the evaluation result is fed back to the mobile client-side through the network, and the evaluation result is showed through the mobile client-side. According to the on-line spoken language pronunciation quality evaluation method and the system, the on-line, convenient and accurate spoken language pronunciation quality evaluation can be achieved.

Description

A kind of online spoken language pronunciation quality evaluating method and system

Technical field

The present invention relates to speech recognition and assessment technique field, particularly relate to a kind of online spoken language pronunciation quality evaluating method and system.

Background technology

The application of signal processing technology in language learning is the important content that infotech and language learning are integrated, its target is combined with current teaching and learning method up-to-date voice technology, set up computer auxiliary language learning system, and spoken language pronunciation quality assessment receives much concern as the important content of assisting language learning always.

But traditional spoken language pronunciation QA system, is confined in traditional language learner or PC mostly, is inconvenient to carry and is limited to wired network link simultaneously, be unfavorable for that learner carries out verbal learning anywhere or anytime.The existing spoken language pronunciation quality score system being applicable to intelligent movable mobile phone, what adopt is the mode of off-line operation, but the intelligent movable mobile phone of main flow can not support that the storage of big data quantity and complicated voice calculate at present, limit the complexity of the pronunciation quality evaluating algorithm on mobile device, its appraisal result truly can not reflect the voice quality level of Oral English Practice learner.Meanwhile, Speech Assessment Account Dept is deployed in language learner, PC and mobile device by existing scheme, is unfavorable for Data Update, storage and algorithm improvement.Further, existing spoken language pronunciation QA system, in pronunciation quality evaluating, the evaluation index of comprehensive consideration is comprehensive not, mostly be confined to independent or a small amount of evaluation index, can not the sound pronunciation quality of user be provided science, comprehensively and accurately be evaluated, often just provide a mark according to pronunciation, lack evaluation and feedback.

Summary of the invention

The object of the embodiment of the present invention there are provided a kind of online spoken language pronunciation quality evaluating method and system, with realize online, easily, spoken language pronunciation quality assessment accurately.

On the one hand, embodiments provide a kind of online spoken language pronunciation quality evaluating method, comprising:

By the tested speech that network reception is gathered by mobile client;

Pre-service is carried out to the tested speech received;

Pretreated tested speech is carried out to the extraction of speech characteristic parameter, obtain the characteristic parameter of described tested speech;

According to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, described tested speech is evaluated, obtain evaluation result;

Give described mobile client by described evaluation result by network-feedback, and by described mobile client, described evaluation result is shown.

Preferably, described online spoken language pronunciation quality evaluating method also comprises:

Described evaluation result is stored in database, and statistical study is carried out to evaluation result, obtain statistics;

Statistics is sent to management of webpage end, and by management of webpage end, described statistics is shown.

Obtain received pronunciation;

Pre-service is carried out to described received pronunciation;

Pretreated received pronunciation is carried out to the extraction of speech characteristic parameter, obtain the characteristic parameter of described received pronunciation.

Preferably, described pre-service comprises pre-emphasis, framing, windowing and end-point detection.

Preferably, the described extraction pretreated tested speech being carried out to speech characteristic parameter, obtains the characteristic parameter of described tested speech, comprising:

Discrete Fourier transform is carried out to described tested speech, obtain the spectral coefficient of described tested speech, described spectral coefficient sequence triangular filter is carried out filtering, logarithm operation is carried out to filtered data, utilize discrete cosine transform, obtain the MFCC characteristic parameter of described tested speech;

The fundamental frequency feature of described tested speech, short-time energy feature, resonance peak feature are extracted, and resonance peak feature described in described fundamental frequency feature, described short-time energy characteristic sum is formed the affective characteristics parameter of described tested speech;

Calculate the pronunciation duration of described tested speech, obtain the pronunciation duration characteristics parameter of described tested speech;

Stress dividing elements is carried out to described tested speech, extracts start frame set of locations and the end frame set of locations of stress, obtain the stress position characteristic parameter of described tested speech;

Voice unit division is carried out to described tested speech, calculates the duration of each voice unit respectively, obtain the voice unit duration characteristics parameter of described tested speech;

Extracted the pitch of described each frame data of tested speech by the auto-relativity function method in time domain, obtain the pitch parameters parameter of described tested speech.

Preferably, describedly according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, described tested speech to be evaluated, obtains evaluation result, comprising:

According to the MFCC characteristic parameter of described tested speech, based on the probabilistic neural network integrated speech model of cognition of Segment Clustering, speech recognition is carried out to described tested speech, obtain voice identification result; And Similarity Measure is carried out to the MFCC characteristic parameter of described tested speech and the MFCC characteristic parameter of described received pronunciation, obtain MFCC related coefficient; According to institute's speech recognition result and described MFCC related coefficient, calculate the accuracy score of described tested speech;

According to the affective characteristics parameter of described tested speech, based on SVM emotion model, emotion recognition is carried out to described tested speech, obtain emotion recognition result; And Similarity Measure is carried out to the affective characteristics parameter of received pronunciation described in the affective characteristics parameter of described tested speech, obtain emotion related coefficient; According to described emotion recognition result and described emotion related coefficient, calculate the emotion score of described tested speech;

According to the pronunciation duration characteristics parameter of described received pronunciation and described tested speech, obtain the word speed ratio of described received pronunciation and described tested speech, and according to described word speed ratio, calculate the word speed score of described tested speech;

According to the stress position characteristic parameter of described tested speech and the stress position characteristic parameter of described received pronunciation, the stress position difference of tested speech described in comparison and described received pronunciation, and according to described stress position difference, calculate the stress score of described tested speech;

According to the voice unit duration characteristics parameter of described tested speech and the voice unit duration characteristics parameter of described received pronunciation, utilize dPVI algorithm, obtain the dPVI parameter of described tested speech, and according to described dPVI parameter, calculate the rhythm score of described tested speech;

According to the pitch parameters parameter of described tested speech and the pitch parameters parameter of described received pronunciation, utilize DTW algorithm, obtain the pitch differentiation of described received pronunciation and described tested speech, and according to described pitch differentiation, calculate the intonation score of described tested speech.

Preferably, describedly according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, described tested speech to be evaluated, obtains evaluation result, also comprise:

Summation is weighted to described accuracy score, described emotion score, described word speed score, described stress score, described rhythm score and described intonation score, obtains integrate score; And according to described accuracy score, described emotion score, described word speed score, described stress score, described rhythm score, described intonation score and described integrate score, in conjunction with the mapping relations of each score and grade evaluation, obtain the class of accuracy evaluation of described tested speech, emotion grade evaluation, word speed grade evaluation, stress grade evaluation, rhythm grade evaluation, intonation grade evaluation and integrated level evaluation; And using the class of accuracy evaluation of described tested speech, emotion grade evaluation, word speed grade evaluation, stress grade evaluation, rhythm grade evaluation, intonation grade evaluation and the integrated level evaluation evaluation result as described tested speech.

According to described evaluation result, the spoken language pronunciation of user is instructed, obtain pronunciation instruction;

Give described mobile client by described pronunciation instruction by network-feedback, and by described mobile client, described pronunciation instruction is shown.

On the other hand, embodiments provide a kind of online spoken language pronunciation QA system, comprise the mobile client and server end that are connected by network;

Described mobile client comprises:

Voice collecting unit, for collecting test voice, and sends to described server end by network by described tested speech;

Described server end comprises:

Pretreatment unit, for carrying out pre-service to the tested speech received;

Characteristic parameter extraction unit, for carrying out the extraction of speech characteristic parameter to pretreated tested speech, obtains the characteristic parameter of described tested speech;

Voice evaluation, for according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, evaluates described tested speech, obtains evaluation result; And give described mobile client by described evaluation result by network-feedback;

Described mobile client also comprises:

Data display unit, for showing described evaluation result.

Preferably, described system also comprises management of webpage end, and described management of webpage end is connected with described server end by network; Described server end also comprises database and statistical analysis unit;

Described database, for storing described evaluation result;

Described statistical analysis unit, for carrying out statistical study to evaluation result, obtains statistics; And described statistics is sent to described management of webpage end;

Described management of webpage end, for showing the statistics received.

Compared with prior art, the advantage of the embodiment of the present invention is:

The embodiment of the present invention is based on C/S (Client/Server, client end/server end) framework, build mobile client and server end, gather the tested speech signal of user by mobile client and send to server end, server end is evaluated backward mobile client to tested speech and is returned Speech Assessment result, shows described evaluation result finally by mobile client.User can utilize mobile Internet access server side easily, obtains service and data, and corpus and evaluation method all can realize synchronous by server end, and provide by server end the speech analysis algorithms process that performance is more excellent, effect is better.

Secondly, the embodiment of the present invention is also based on B/S (Browser/Server, page end/server end) framework, build management of webpage end and server end, can by the spoken language pronunciation quality assessment statistics of web browser Real-time Obtaining mobile client end subscriber from the database of server end, for third party (as instructor) provides the spoken language pronunciation situation of mobile client end subscriber, be convenient to third party and formulate spoken under line guidance and improvement strategy.

Further, the embodiment of the present invention carries out various dimensions Speech Assessment for tested speech, and the evaluation method of each index is reasonable, credible, and for the spoken language pronunciation feedback pronunciation instruction of user, can contribute to the mispronounce correcting user, improve voice quality.

Accompanying drawing explanation

Fig. 1 is the flow chart of steps of an embodiment of online spoken language pronunciation quality evaluating method provided by the invention;

Fig. 2 is the process of establishing schematic diagram of probabilistic neural network integrated classifier provided by the invention;

Fig. 3 is the C/S configuration diagram of an embodiment of online spoken language pronunciation QA system provided by the invention;

Fig. 4 is the B/S configuration diagram of online spoken language pronunciation QA system as shown in Figure 3.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described.It should be noted that, the label in embodiment before each step, only in order to more clearly identify each step, does not have the restriction of inevitable sequencing between each step.

See Fig. 1, be the flow chart of steps of an embodiment of online spoken language pronunciation quality evaluating method provided by the invention, described method comprises:

S1, the tested speech gathered by mobile client by network reception.

In the middle of concrete enforcement, on the mobile phone that mobile client is installed on user in the mode of application program or other mobile devices, recorded by the recorded program called in mobile device, gather the voice that user sends in oral test, and generating the audio file of consolidation form, described mobile client sends to server end by network after carrying out compressed encoding to described audio file.Wherein, described audio file is preferably the audio file of wav form, described network is preferably mobile Internet, mobile client and server end adopt the Socket based on TCP/IP (Transmission Control Protocol/Internet Protocol, transmission control protocol/Internet Protocol) communication protocol to carry out data transmission.

S2, carries out pre-service to the tested speech received.

Server end, after receiving the data that mobile client sends, carries out decompression decoding to the data received, is reduced to the source document of tested speech.Simultaneously before treatment and analysis is carried out to tested speech, in order to eliminate because people's vocal organs itself and the impact that produces tested speech due to speech ciphering equipment, pre-service is carried out to tested speech, extraction for subsequent voice characteristic parameter provides the data source of high-quality, thus improves the quality of speech processes.Pre-service described in the present embodiment includes but not limited to pre-emphasis, framing, windowing and end-point detection, specific as follows:

2.1) pre-emphasis: the average power spectra of tested speech is subject to the impact of glottal excitation and mouth and nose radiation, front end presses 6dB/oct decay about more than 800Hz greatly, the higher corresponding composition of frequency is less, therefore needs to be promoted the HFS of described tested speech before analyzing tested speech.The present embodiment adopted the high boost pre-emphasis digital filter of a 6dB/oct before analyzing tested speech, the HFS of described tested speech is promoted, make the frequency spectrum of described tested speech become smooth, remain on low frequency in the frequency band of high frequency.The computing formula of pre-emphasis is as follows:

Y (n)=x (n)-0.9375*x (n-1) (formula 1)

Wherein, x (n) is original tested speech.

2.2) framing: voice signal has time-varying characteristics, but in a short time range, its characteristic remains unchanged namely relatively stable substantially, this characteristic of voice signal is called " short-time characteristic ", this short time range is generally 10 ~ 30ms, so be based upon on the basis of " short-time characteristic " to the treatment and analysis of tested speech, " short-time analysis " (i.e. sub-frame processing) is carried out to tested speech.Owing to there is correlativity between voice signal, the present embodiment adopts the mode of the overlapping framing of field to carry out framing to described tested speech.

2.3) windowing: for realizing being emphasized the speech waveform near sample position in tested speech and weakened the remainder of waveform, Hamming window is selected to carry out windowing to tested speech in the present embodiment, carry out windowing process after framing can reduce owing to blocking the Gibbs' effect (Gibbs phenomenon) caused, make the frequency spectrum of tested speech comparatively level and smooth.In the attainable mode of one, the computing formula of windowing is as follows:

S _ω(n)=y (n) * ω (n) (formula 2)

Wherein, y (n) is the voice signal after pre-emphasis, and ω (n) is window function.

2.4) end-point detection: the present embodiment adopts double-threshold comparing method to carry out end-point detection, detects starting point and the end point of tested speech.Double-threshold comparing method is using short-time energy E and short-time average zero-crossing rate Z as feature, in conjunction with the advantage of short-time average zero-crossing rate Z and short-time energy E, make detection more accurate, the processing time of effective reduction system, the real-time of raising system process, and the noise of unvoiced segments can be got rid of, thus the handling property of the voice signal improved.

S3, carries out the extraction of speech characteristic parameter to pretreated tested speech, obtain the characteristic parameter of described tested speech.The characteristic parameter of described tested speech comprises MFCC (Mel-Frequency CepstralCoefficients, Mel cepstrum coefficient) characteristic parameter, affective characteristics parameter, pronunciation duration characteristics parameter, stress position characteristic parameter, voice unit duration characteristics parameter and pitch parameters parameter, the characteristic parameter extraction process of carrying out at server end is as follows:

3.1) discrete Fourier transform (DFT is carried out to described tested speech, Discrete Fourier Transform), obtain the spectral coefficient of described tested speech, described spectral coefficient sequence triangular filter is carried out filtering, logarithm operation is carried out to filtered data, utilize discrete cosine transform, obtain the MFCC characteristic parameter of described tested speech.Concrete steps are as follows:

Discrete Fourier transform is carried out to pretreated tested speech and obtains spectral coefficient X (k).

Filtering process is carried out to spectral coefficient X (k) sequence triangular filter, obtains one group of Coefficient m _i.Calculate described Coefficient m ⁱformula as follows:

M _i=ln [X (k) * H _i(k)] (formula 3)

Wherein,

(formula 4)

The centre frequency that f [i] is triangular filter, meets:

Mel (f [i+1])-Mel (f [i])=Mel (f [i])-Mel (f [i-1]) (formula 5)

Ask logarithm to the output of all wave filters, cepstral coefficients is tried to achieve in recycling discrete cosine transform, and computing formula is as follows:

C_{i} = \sqrt{\frac{2}{P} Σ_{j = 1}^{P} \log (m_{i}) \cos [\frac{πi}{P} (j - 0.5)]}

(formula 6)

Wherein, P is the number of triangular filter, C _ifor required MFCC characteristic parameter.Preferably, the exponent number of described MFCC characteristic parameter is set to 12.

3.2) the fundamental frequency feature of described tested speech, short-time energy feature, resonance peak feature are extracted, and resonance peak feature described in described fundamental frequency feature, described short-time energy characteristic sum is formed the affective characteristics parameter of described tested speech.

3.2.1) fundamental frequency feature: periodicity when pitch period refers to and sends out voiced sound caused by vocal cord vibration, the inverse of fundamental frequency and pitch period.Fundamental frequency is one of most important parameter of voice signal, and research shows that fundamental frequency can reflect the change of emotion.The detection method of fundamental frequency feature includes but not limited to auto-relativity function method (ACF), cepstral analysis method, average magnitude difference function method (AMDF) and wavelet method.Preferably cepstral analysis method is adopted in the present embodiment, Fourier transform is carried out to pretreated tested speech, obtain the amplitude spectrum of described tested speech, described amplitude spectrum is taken the logarithm, obtain the one-period signal of tested speech at frequency domain, calculate the frequency values of described periodic signal, the fundamental frequency value of described tested speech can be obtained.Inverse Fourier transform is carried out to described periodic signal, obtains a peak value at pitch period place.Again by calculating maximal value, minimum value, average, the 7 rank fundamental frequency statistical variations parameters such as intermediate value and standard deviation of fundamental frequency after drawing fundamental frequency value, as the fundamental frequency feature of tested speech.

3.2.2) short-time energy feature: the energy of voice signal and the expression of emotional speech have comparatively High relevancy, energy greatly then show the volume of sound and loudness relatively large.In the life of reality, when people are angry or angry time, the volume of pronunciation is larger.When people are dejected or sad time, often spoken sounds is lower.Speech signal energy has short-time energy and short-time average magnitude energy two kinds usually, preferably chooses the short-time energy of tested speech as energy parameter.Short-time energy is the weighted sum of squares of a frame sampling point value, and short-time energy is defined as follows:

E_{n} = Σ_{m = 0}^{N - 1} x_{n}^{2} (m)

(formula 7)

Wherein, x _nm () is the n-th frame signal of tested speech.

After drawing short-time energy, then by 7 rank short-time energy statistical variations parameters such as the maximal value, minimum value, average, intermediate value and the standard deviations that calculate short-time energy, as the short-time energy feature of tested speech.

3.2.3) resonance peak feature: resonance peak is an important parameter of reflection tract characteristics, when sound stimulation is by sound channel, can produce formant frequency.When people is in different emotions state, the tensity of its nerve is different, causes sound channel deformation, and corresponding change occurs formant frequency.The present embodiment preferably utilizes the formant parameter of linear prediction method to every frame voice signal to extract, can be quick, excellent and extract formant parameter effectively, the first resonance peak and second resonance peak of voice signal is obtained by linear prediction method, again with Segment Clustering method by the first resonance peak and the second resonance peak regular be 32 rank parameters, as the resonance peak feature of described tested speech.By resonance peak feature, fundamental frequency characteristic sum short-time energy integrate features together, the speech emotional characteristic parameter on 46 rank is formed.

3.3) calculate the pronunciation duration of described tested speech, obtain the pronunciation duration characteristics parameter of described tested speech.

In the middle of concrete enforcement, by setting the high-low limits of short-time energy and zero-crossing rate, end-point detection can be carried out to tested speech, obtains the pronunciation duration of tested speech.

3.4) stress dividing elements is carried out to described tested speech, extract start frame set of locations and the end frame set of locations of stress, obtain the stress position characteristic parameter of described tested speech.

Stress dividing elements flow process is as follows:

A. the energy value of tested speech is extracted.The feature that in tested speech, stressed syllable is loud will be reflected to the energy intensity in time domain, and namely to show as speech energy intensity large for syllable.

B. carry out regular to tested speech.Due to the gap between speaker's word speed, there is some difference to the pronunciation duration of same sentence for different speakers, but different people is but followed stress unit duration to the pronunciation of same sentence and accounted for a certain proportion of rule of whole sentence.Therefore, when marking to tested speech, can by transferring the pronunciation duration characteristics parameter of received pronunciation, by regular in proportion for the pronunciation duration of described tested speech be identical with the pronunciation duration of described received pronunciation, be conducive to the process of data, also make the evaluation of system more objective.

C. the syllable of tested speech is extracted.In the middle of concrete enforcement, double-threshold comparing method can be adopted to carry out stress end-point detection, according to the energy value of tested speech, search in tested speech one by one and be greater than stress threshold values T _umaximum voice signal value S _max, then to signal value S _maxleft-right Searching equals non-stress threshold values T _lvoice signal value S _lwith S _r, by S _lwith S _rbe set to the stress signal of tested speech, and by S _lwith S _rbetween signal quantity set to 0, avoid repetition at S _lwith S _rbetween search.Due to the feature that stressed syllable in tested speech has pronunciation partially to grow, and the stressed syllable unit that the first step searches for out may to there is energy value large, namely audible representation is loud for pronouncing, but the problem that the duration is very short, these unit may be short vowels, also may be the interference of signal peaks, they do not form stressed syllable, therefore can stressed syllable unit be screened the feature partially long according to stressed syllable pronunciation further, the minimum duration of stressed syllable unit is set as one is roughly read again vowel duration (Stressed vowel durations), be preferably 100ms, and contrast according to the minimum duration of setting.

By above step, complete the division to sentence stress unit, start frame set of locations and the end frame set of locations of the stress of sentence can be known, and using described start frame set of locations and the end frame set of locations stress position characteristic parameter as described tested speech.

3.5) voice unit division is carried out to described tested speech, calculate the duration of each voice unit respectively, obtain the voice unit duration characteristics parameter of described tested speech.The duration of institute's speech units refers to that each voice unit starts the duration to terminating.

3.6) extracted the pitch of described each frame data of tested speech by the auto-relativity function method (ACF) in time domain, obtain the pitch parameters parameter of described tested speech.

Auto-relativity function method uses autocorrelation function to calculate sound frame s (i) and the similarity of self, wherein, i=0 ~ n-1, computing formula is as follows:

acf (τ) = Σ_{i = 0}^{n - 1 - τ} s (i) s (i + τ)

(formula 8)

Wherein, n refers to the length of a frame speech data, and τ is time delay, finds out the τ value that acf (τ) can be made interior in some rational given zone, just can calculate the pitch of this sound frame.In concrete ACF computation process, by speech frame at every turn to right translation a bit, the lap of the sound frame after translation and former sound frame is done inner product, and n the inner product value obtained after repeating n time is exactly ACF value corresponding to a speech frame.

S4, according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, evaluates described tested speech, obtains evaluation result.

It should be noted that, the characteristic parameter of received pronunciation obtains by carrying out speech characteristic parameter extraction to received pronunciation in advance, is stored in database, transfers when needs use.The concrete steps extracting the characteristic parameter of received pronunciation comprise: obtain received pronunciation; Pre-service is carried out to described received pronunciation; Pretreated received pronunciation is carried out to the extraction of speech characteristic parameter, obtain the characteristic parameter of described received pronunciation.The concrete steps of the characteristic parameter extraction of received pronunciation are consistent with the characteristic parameter extraction process of tested speech, do not repeat them here.

According to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, the process evaluated described tested speech is specific as follows:

4.1) according to the MFCC characteristic parameter of described tested speech, based on probabilistic neural network (the Probabilistic Neural Network of Segment Clustering, PNN) integrated speech model of cognition, carries out speech recognition to described tested speech, obtains voice identification result.And Similarity Measure is carried out to the MFCC characteristic parameter of described tested speech and the MFCC characteristic parameter of described received pronunciation, obtain MFCC related coefficient.According to institute's speech recognition result and described MFCC related coefficient, calculate the accuracy score of described tested speech.It should be noted that, the probabilistic neural network integrated speech model of cognition of described Segment Clustering is that precondition obtains, and is stored in database, transfers when needs use.

In the present embodiment, adopt Bagging (Bootstrap aggregating, self-service gathering) thought generates integrated required individual probabilistic neural network model, Bagging is a kind of integrated learning approach multiple different individual learner being integrated into a learner, by repeated sampling different data subsets can be obtained, make in different pieces of information subset, train the individual learner obtained have higher Generalization Capability and have larger diversity factor.Utilize the Distributed Calculation of existing network can improve the time efficiency of algorithm further, and Bagging can improve the performance of learner, be conducive to the classification accuracy and the generalization ability that improve probabilistic neural network.

With reference to Fig. 2, it is the process of establishing schematic diagram of probabilistic neural network integrated classifier provided by the invention.From training sample set A, randomly draw n sample (as Bagging sample A1, Bagging sample A2 in figure at every turn ... Bagging sample An), train by probabilistic neural network sorting algorithm, obtain a PNN sorter, utilize identical method to generate multiple PNN sorter (i.e. PNN sorter C in figure ₁(x), PNN sorter C ₂(x) ... PNN sorter C _n(x)), a classification function sequence C can be obtained after training ₁(x), C ₂(x) ... C _n(x), i.e. PNN integrated classifier, the namely integrated speech of PNN described in the present embodiment model of cognition, final classification function C (x) adopts ballot mode to classification problem, and who gets the most votes's classification results is the final classification of classification function C (x).

In the process of speech recognition, only need the MFCC characteristic parameter of described tested speech to be input in described PNN integrated speech model of cognition, classify in the mode of voting, judge that whether content is correct.Carry out Similarity Measure to the MFCC characteristic parameter of described tested speech and the MFCC characteristic parameter of described received pronunciation, the accuracy of size to tested speech that be whether correct and MFCC related coefficient of last foundation content is marked simultaneously.

4.2) according to the affective characteristics parameter of described tested speech, based on SVM (Support Vector Machine, support vector machine) emotion model, emotion recognition is carried out to described tested speech, obtain emotion recognition result.And Similarity Measure is carried out to the affective characteristics parameter of received pronunciation described in the affective characteristics parameter of described tested speech, obtain emotion related coefficient.According to described emotion recognition result and described emotion related coefficient, calculate the emotion score of described tested speech.

After the affective characteristics parameter extraction of tested speech is complete, by affective characteristics parameters input to classifying based on SVM emotion model, calculate the related coefficient of the affective characteristics parameter of tested speech and the affective characteristics parameter of received pronunciation simultaneously.Finally, emotion score is drawn according to emotional semantic classification result related coefficient size that is whether correct and affective characteristics parameter.

4.3) according to the pronunciation duration characteristics parameter of described received pronunciation and described tested speech, obtain the word speed ratio of described received pronunciation and described tested speech, and according to described word speed ratio, calculate the word speed score of described tested speech.

After extracting the pronunciation duration characteristics parameter of described tested speech, by following formulae discovery word speed ratio:

(formula 9)

Wherein, S _{pronunciation duration}refer to the duration of received pronunciation, T _{pronunciation duration}refer to the pronunciation duration of tested speech.

Word speed is too fast or cross the requirement all not meeting linguistics slowly and express, therefore can according to word speed ratio, and the degree too fast or excessively slow by word speed, marks to the word speed of tested speech.

4.4) according to the stress position characteristic parameter of described tested speech and the stress position characteristic parameter of described received pronunciation, the stress position difference of tested speech described in comparison and described received pronunciation, and according to described stress position difference, calculate the stress score of described tested speech.

When extracting stress position characteristic parameter, obtain start frame position and the end frame set of locations of stress, by the stress distributional difference diff of following formulae discovery tested speech with mark voice:

diff = Σ_{i = 1}^{n} {(\frac{{left}_{std} [i]}{{Len}_{std}} - \frac{{left}_{test} [i]}{{Len}_{test}}) + (\frac{{right}_{std} [i]}{{Len}_{std}} - \frac{{right}_{test} [i]}{{Len}_{test}})}

(formula 10)

Wherein, Len _stdrefer to the efficient voice frame length of received pronunciation, Len _testrefer to the efficient voice frame length of tested speech.Left _std[i] is the start frame set of locations of received pronunciation, right _std[i] is the end frame set of locations of received pronunciation.Left _test[i] is the start frame set of locations of tested speech, right _test[i] is the end frame set of locations of tested speech.

The stress position difference size of foundation tested speech and received pronunciation, marks to the stress of described tested speech.

4.5) according to the voice unit duration characteristics parameter of described tested speech and the voice unit duration characteristics parameter of described received pronunciation, utilize dPVI (the Distinct Pairwise Variability Index) algorithm, obtain the dPVI parameter of described tested speech, and according to described dPVI parameter, calculate the rhythm score of described tested speech.

After extracting the voice unit duration characteristics parameter of tested speech, the voice unit duration characteristics parameter of the syllable unit duration characteristics parameter of tested speech and received pronunciation is carried out contrast budget, and the dPVI parameter changed out for system score basis, the computing formula of dPVI parameter is as follows:

dPVI = 100 \times (Σ_{k = 1}^{m - 1} | {d 1}_{k} - {d 2}_{k} | + | {d 1}_{t} - {d 2}_{t} |) / Len

(formula 11)

Wherein, d is the voice unit duration (as: d that sentence divides _kfor a kth voice unit duration), m=min (S _snum, T _snum), S _snumfor the voice unit number of received pronunciation, T _snumfor the voice unit number of tested speech, Len is the duration of received pronunciation.

According to the size of dPVI parameter, calculate the rhythm score of described tested speech.

4.6) according to the pitch parameters parameter of described tested speech and the pitch parameters parameter of described received pronunciation, utilize DTW (Dynamic Time Warping, dynamic time consolidation) algorithm, obtain the pitch differentiation of described received pronunciation and described tested speech, and according to described pitch differentiation, calculate the intonation score of described tested speech.

After extracting the pitch parameters parameter of described tested speech, can also, by arranging median wave filter, coming pitch smoothing, excluding the speech frame of instability, pitch value exception.Utilize DTW algorithm that the pitch parameters parameter of tested speech and the pitch parameters parameter of received pronunciation are carried out otherness contrast, calculate pitch differentiation parameter d ist therebetween, calculate the intonation score of described tested speech again, the computing formula of intonation score is as follows:

S_{intonation} = \frac{100}{1 + a \times {(dist)}^{b}}

(formula 12)

Wherein, by emulation experiment, contrast expert analysis mode data and system score data, calculate a=0.0005, b=2.

4.7) summation is weighted to described accuracy score, described emotion score, described word speed score, described stress score, described rhythm score and described intonation score, obtains integrate score.And according to described accuracy score, described emotion score, described word speed score, described stress score, described rhythm score, described intonation score and described integrate score, in conjunction with the mapping relations of each score and grade evaluation, obtain the class of accuracy evaluation of described tested speech, emotion grade evaluation, word speed grade evaluation, stress grade evaluation, rhythm grade evaluation, intonation grade evaluation and integrated level evaluation.And using the class of accuracy evaluation of described tested speech, emotion grade evaluation, word speed grade evaluation, stress grade evaluation, rhythm grade evaluation, intonation grade evaluation and the integrated level evaluation evaluation result as described tested speech.

Be weighted in the process of summation to described accuracy score, described emotion score, described word speed score, described stress score, described rhythm score and described intonation score, shared by each index mark, weight can adopt different values according to different demands, can close the weight combination of user's request according to user's own characteristic selector.According to the mapping relations of each score and grade evaluation, obtain grade evaluation and the integrated level evaluation of each index.Such as, if described accuracy score is in the fraction range of 90 ~ 100, then described class of accuracy is evaluated as A level; If described accuracy score is in the fraction range of 70 ~ 90, then described class of accuracy is evaluated as B level; If described accuracy score is in the fraction range of 60 ~ 70, then described class of accuracy is evaluated as C level; If described accuracy score is in the fraction range of 0 ~ 60, then described class of accuracy is evaluated as D level.The mapping relations that the mapping relations of other scores and grade evaluation and above-mentioned accuracy score and class of accuracy are evaluated are similar, do not repeat them here.It should be noted that, the mapping relations of above-mentioned mark and grade are only an example of the present invention, in the middle of practical application, can be according to actual needs, different threshold values is set, different fraction range is mapped in different grades, also can divides more grade natch.

S5, is given described mobile client by described evaluation result by network-feedback, and is shown described evaluation result by described mobile client.

After described server end obtains evaluation result, feed back to mobile client by evaluation result by mobile Internet, evaluation result information by evaluation result information displaying on the screen of the mobile device, or is pointed out by audible by mobile client.

In the middle of concrete enforcement, described server end according to described evaluation result, can also instruct the spoken language pronunciation of user after obtaining the evaluation result of tested speech, obtains pronunciation instruction.According to evaluation result, can mate with the pronunciation instruction in database.

Give described mobile client by described pronunciation instruction by network-feedback, and by described mobile client, described pronunciation instruction is shown.Point out the mistakes and short comings in user's spoken language pronunciation by pronunciation instruction, and propose the suggestion of improvement, if detect that the word speed of user crosses fast pace confusion, user can be pointed out can to slow down word speed a little, hold sentence rhythm etc.

The embodiment of the present invention is based on C/S (Client/Server, client end/server end) framework, build mobile client and server end, mobile client gathers the tested speech signal of user and sends to server end, server end is evaluated backward mobile client to tested speech and is returned Speech Assessment result, and is shown described evaluation result by mobile client.User can utilize mobile Internet access server side easily, obtains service and data, and corpus and evaluation method all can realize synchronous by server end, and provide by server end the speech analysis algorithms process that performance is more excellent, effect is better.

Further, described online spoken language pronunciation quality evaluating method also comprises:

S6, is stored into described evaluation result in database, and carries out statistical study to evaluation result, obtains statistics.

In the middle of concrete enforcement, when user test is complete, can by the user profile of user, tested speech and evaluation result are stored in database, described server end carries out statistical study to the evaluation result (comprising each index score and integrate score) in database, obtain the learning information analysis result of unique user, also can obtain group's learning information analysis result or the whole network study situation statistics for the user of specific user group or for all users of the whole network.

S7, sends to management of webpage end by statistics, and is shown described statistics by management of webpage end.The statistics that management of webpage end reception server end is evaluated the spoken language pronunciation of mobile client end subscriber, presents in visual form third party (as instructor).

The embodiment of the present invention is based on B/S (Browser/Server, page end/server end) framework, build management of webpage end and server end, can by the spoken language pronunciation quality assessment statistics of web browser Real-time Obtaining mobile client end subscriber from the database of server end, for third party provides the spoken language pronunciation situation of mobile client end subscriber, be convenient to third party and formulate spoken under line guidance and improvement strategy.

With reference to Fig. 3, it is the C/S Organization Chart of an embodiment of online spoken language pronunciation QA system provided by the invention.Described online spoken language pronunciation QA system with embodiment illustrated in fig. 1 in the ultimate principle of online spoken language pronunciation quality evaluating method consistent, in the present embodiment in detail part is not described in detail, can associated description in embodiment shown in Figure 1.

Described system comprises the mobile client 100 and server end 200 that are connected by network.

Described mobile client 100 comprises:

Voice collecting unit 101, for collecting test voice, and sends to described server end 200 by network by described tested speech.

Described server end 200 comprises:

Pretreatment unit 201, for carrying out pre-service to the tested speech received.

Characteristic parameter extraction unit 202, for carrying out the extraction of speech characteristic parameter to pretreated tested speech, obtains the characteristic parameter of described tested speech.

Voice evaluation 203, for according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, evaluates described tested speech, obtains evaluation result; And give described mobile client 100 by described evaluation result by network-feedback.

Described mobile client 100 also comprises:

Data display unit 102, for showing described evaluation result.

With reference to Fig. 4, it is the B/S configuration diagram of online spoken language pronunciation QA system as shown in Figure 3.

Described system also comprises management of webpage end 300, and described management of webpage end 300 is connected with described server end 200 by network.Described server end 200 also comprises database 204 and statistical analysis unit 205.

Described database 204, for storing described evaluation result.

Described statistical analysis unit 205, for carrying out statistical study to evaluation result, obtains statistics.And described statistics is sent to described management of webpage end 300.

Described management of webpage end 300, for showing the statistics received.

The embodiment of the present invention is based on C (B)/S, build mobile client 100, server end 200 and management of webpage end 300, gather the tested speech signal of user by mobile client 100 and send to server end 200, server end 200 pairs of tested speech are evaluated backward mobile client 100 and are returned Speech Assessment result, are shown described evaluation result by mobile client 100.User can utilize mobile Internet access server side 200 easily, obtain service and data, corpus and evaluation method all can be realized synchronously by server end 200, and provide by server end 200 the speech analysis algorithms process that performance is more excellent, effect is better.Can also by the spoken language pronunciation quality assessment statistics of management of webpage end 300 Real-time Obtaining mobile client end subscriber from the database of server end 200, for third party (as instructor) provides the spoken language pronunciation situation of mobile client end subscriber, be convenient to third party and formulate spoken under line guidance and improvement strategy.

The online spoken language pronunciation quality evaluating method that the embodiment of the present invention provides and system can be applicable to, in Oral English Practice study, detect the voice quality of Oral English Practice.Also the pronunciation quality evaluating of other languages can be applied to, as Japanese and French.

By the description of above embodiment, those skilled in the art can be well understood to the mode that the present invention can add required common hardware by software and realize, and can certainly comprise special IC, dedicated cpu, private memory, special components and parts etc. realize by specialized hardware.Technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this software product is stored in the storage medium that can read, as the floppy disk of computing machine, USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random AccessMemory), magnetic disc or CD etc.

The above; be only the specific embodiment of the present invention, but protection scope of the present invention is not limited thereto, is anyly familiar with those skilled in the art in the technical scope that the present invention discloses; change can be expected easily or replace, all should be encompassed within protection scope of the present invention.Therefore, protection scope of the present invention should be as the criterion with the protection domain of described claim.

Claims

1. an online spoken language pronunciation quality evaluating method, is characterized in that, comprising:

By the tested speech that network reception is gathered by mobile client;

Pre-service is carried out to the tested speech received;

2. online spoken language pronunciation quality evaluating method as claimed in claim 1, it is characterized in that, described method also comprises:

3. online spoken language pronunciation quality evaluating method as claimed in claim 1, it is characterized in that, described method also comprises:

Obtain received pronunciation;

Pre-service is carried out to described received pronunciation;

4. the online spoken language pronunciation quality evaluating method as described in any one of claims 1 to 3, it is characterized in that, described pre-service comprises pre-emphasis, framing, windowing and end-point detection.

5. the online spoken language pronunciation quality evaluating method as described in any one of claims 1 to 3, is characterized in that, the described extraction pretreated tested speech being carried out to speech characteristic parameter, obtains the characteristic parameter of described tested speech, comprising:

6. spoken language pronunciation quality evaluating method as claimed in claim 5 online, is characterized in that, describedly evaluates described tested speech according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, obtains evaluation result, comprising:

7. spoken language pronunciation quality evaluating method as claimed in claim 6 online, is characterized in that, describedly evaluates described tested speech according to the characteristic parameter of described tested speech and the characteristic parameter of received pronunciation, obtains evaluation result, also comprises:

8. online spoken language pronunciation quality evaluating method as claimed in claim 7, it is characterized in that, described method also comprises:

9. an online spoken language pronunciation QA system, is characterized in that, comprises the mobile client and server end that are connected by network;